Xcell - Xilinx

ISSUE 58, THIRD QUARTER 2006XCELL JOURNAL

XILINX, INC.

R

Issue 58Third Quarter 2006

T H E A U T H O R I T A T I V E J O U R N A L F O R P R O G R A M M A B L E L O G I C U S E R S

Xcell journalXcell journalT H E A U T H O R I T A T I V E J O U R N A L F O R P R O G R A M M A B L E L O G I C U S E R S

INSIDE

ESL Tools Make FPGAs Nearly Invisible to Designers

Evaluating HardwareAcceleration Strategies Using C-to-Hardware Tools

The Benefits of FPGA Coprocessing

Accelerating SystemPerformance Using ESLDesign Tools and FPGAs

Tune Multicore Hardware for Software

INSIDE

ESL Tools Make FPGAs Nearly Invisible to Designers

Evaluating HardwareAcceleration Strategies Using C-to-Hardware Tools

The Benefits of FPGA Coprocessing

Accelerating SystemPerformance Using ESLDesign Tools and FPGAs

Tune Multicore Hardware for Software

ESL Expandsthe Reach of FPGAs

ESL Expandsthe Reach of FPGAs

www.xilinx.com/xcell/

Enabling success from the center of technology™

1 800 332 8638em.avnet.com

© Avnet, Inc. 2006. All rights reserved. AVNET is a registered trademark of Avnet, Inc.

The Xilinx® Spartan™-3 and Virtex™-4 FX 12 Mini-Modules are

off-the-shelf, fully configurable, complete system-on-module

(SOM) embedded solutions capable of supporting MicroBlaze™ or

PowerPC® embedded processors from Xilinx.

Offered as stand-alone modules and as development kits with

companion baseboards, power supplies, software and reference

designs, these modules can be easily integrated into custom

applications and can facilitate design migration.

Gain hands-on experience with these development kits and other Xilinx

tools by participating in an Avnet SpeedWay Design Workshop™ this

spring.

Learn more about these mini-modules and upcoming SpeedWay

Design Workshops at:

www.em.avnet.com/spartan3miniwww.em.avnet.com/virtex4miniwww.em.avnet.com/speedway

Support Across The Board.™

Jump Start Your Designs with Mini-Module Development Platforms

Mini-Module Features

• Small footprint (30 mm x 65.5 mm)

• Complete system-on-module solution with SRAM, Flash, Ethernet port and configuration memory

• Supports MicroBlaze™ or PowerPC® processors

• Based on the Virtex™-4 FX12 and 400K Gate Spartan™-3 FPGAs

• 76 User I/O

Baseboard Features

• Supports mini-module family

• Module socket allows easy connection

• Provides 3.3 V, 2.5 V and 1.2 V

• USB and RS232 ports

• 2 x 16 character LCD

Upcoming SpeedWay Design Workshops

• Developing with MicroBlaze

• Developing with PowerPC

• Xilinx Embedded Software Development

• Embedded Networking with Xilinx FPGAs

TThis issue of the Xcell Journal features a very cohesive collection of articles about electronic system level (ESL) design. ESL is an umbrella term for tools and methods that allow designerswith software programming skills to easily implement their ideas in programmable hardware (likeFPGAs) without having to learn traditional hardware design techniques. The proliferation of thesetools will make it easier for designers to use programmable devices for algorithm acceleration,high-performance computing, high-speed packet processing, and rapid prototyping.

In an effort to organize those vendors developing ESL products, in March 2006 Xilinx launchedthe ESL Initiative, inviting companies such as Celoxica and Impulse Accelerated Technologies to optimize support for Xilinx® embedded, DSP, and logic platforms and to establish commonstandards for ESL tool interoperability, among other goals.

Xilinx also set its own standards for participation in the ESL Initiative. To qualify as a partner, acompany’s ESL methodologies must be at a higher level of abstraction than RTL. They must alsodemonstrate a working flow for FPGAs and position their tools as an FPGA solution rather thanan ASIC solution that also happens to work with FPGAs. Participants were additionally invited towrite an article for this issue.

We also invited FPGA Journal Editor Kevin Morris to share his thoughts about ESL, and as you’llsee he offers quite a interesting perspective. Depending on who you talk to, ESL is either all hypeor the real deal. At Xilinx, we’ve committed to this promising technology – working with a varietyof vendors (12 and growing) whose solutions are designed for a variety of applications.

I invite you to learn more about ESL and particularly each company’s respective products by delvingfurther into this issue of the Xcell Journal, or by visiting www.xilinx.com/esl.

L E T T E R F R O M T H E P U B L I S H E R

Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/

© 2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. PowerPC is atrademark of IBM, Inc. All other trademarks are theproperty of their respective owners.

The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.

A Shortcut to Success

Forrest CouchPublisher

PUBLISHER Forrest [email protected]

EDITOR Charmaine Cooper Hussain

ART DIRECTOR Scott Blair

ADVERTISING SALES Dan Teie1-800-493-5551

TECHNICAL COORDINATOR Milan Saini

INTERNATIONAL Dickson Seow, Asia [email protected]

Andrea Barnard, Europe/Middle East/[email protected]

Yumi Homura, [email protected]

SUBSCRIPTIONS All Inquirieswww.xcellpublications.com

REPRINT ORDERS 1-800-493-5551

Xcell journal

www.xilinx.com/xcell/

O N T H E C O V E R

2929

E S L

4141

E S L

1616

E S L

Evaluating Hardware Acceleration Strategies Using C-to-Hardware ToolsSoftware-based methods enable iterative design and optimizationfor performance-critical applications.

5555E S L

Tune Multicore Hardware for SoftwareTeja FP and Xilinx FPGAs give you more control over hardware configuration.

The Benefits of FPGA CoprocessingThe Xilinx ESL Initiative brings the horsepower of an FPGA coprocessor closer to the reach of traditional DSP system designers.

Accelerating System Performance Using ESL Design Tools and FPGAsESL design tools provide the keys for opening more applications to algorithm acceleration.

ESL Tools Make FPGAs Nearly Invisible to DesignersYou can increase your productivity through ESL methodologies.Viewpoint

66

T H I R D Q U A R T E R 2 0 0 6 , I S S U E 5 8 Xcell journalXcell journalVIEWPOINTS

ESL Tools Make FPGAs Nearly Invisible to Designers ................................................................6

Join the “Cool” Club ...........................................................................................................9

ESL

Bringing Imaging to the System Level with PixelStreams .......................................................12

Evaluating Hardware Acceleration Strategies Using C-to-Hardware Tools ....................................16

Accelerating Architecture Exploration for FPGA Selection and System Design .............................19

Rapid Development of Control Systems with Floating Point ....................................................22

Hardware/Software Partitioning from Application Binaries......................................................26

The Benefits of FPGA Coprocessing .....................................................................................29

Using SystemC and SystemCrafter to Implement Data Encryption............................................32

Scalable Cluster-Based FPGA HPC System Solutions...............................................................35

Accelerate Image Applications with Poseidon ESL Tools ..........................................................38

Accelerating System Performance Using ESL Design Tools and FPGAs .......................................41

Transforming Software to Silicon.........................................................................................46

A Novel Processor Architecture for FPGA Supercomputing........................................................49

Turbocharging Your CPU with an FPGA-Programmable Coprocessor ...........................................52

Tune Multicore Hardware for Software.................................................................................55

Automatically Identifying and Creating Accelerators Directly from C Code ..................................58

GENERAL

ExploreAhead Extends the PlanAhead Performance Advantage ................................................62

Designing Flexible, High-Performance Embedded Systems ......................................................66

Tracing Your Way to Better Embedded Software....................................................................70

To Interleave, or Not to Interleave: That is the Question.........................................................74

Using FPGA-Based Hybrid Computers for Bioinformatics Applications ........................................80

II DRR Delivers High-Performance SDR.................................................................................84

Eliminate Packet Buffering Busywork...................................................................................88

REFERENCE..................................................................................................................92

by Steve LassSr. Director, Software MarketingXilinx, [email protected]

FPGA business dynamics have been ratherconsistent over the past decade. Price per gatehas continued to fall annually by an averageof 25%, while device densities have contin-ued to climb by an average of 56%.Concurrent with advances in silicon, designmethodologies have also continued to evolve.

In particular, a new paradigm known aselectronic system level (ESL) design ispromising to usher in a new era in FPGAdesign. Although the term ESL is broadand its definition subject to different inter-pretations, here at Xilinx it refers to toolsand methodologies that raise designabstraction to levels above the currentmainstream register transfer level (RTL)language (Figure 1).

6 Xcell Journal Third Quarter 2006

ESL Tools Make FPGAs Nearly Invisible to Designers ESL Tools Make FPGAs Nearly Invisible to Designers You can increase your productivity through ESL methodologies. You can increase your productivity through ESL methodologies.

V I E W P O I N T

Put another way, it is now possible foryou to capture designs in high-level lan-guages (HLLs) such as ANSI C and imple-ment them in an optimized manner onFPGA hardware. These advances in soft-ware methodologies – when combinedwith the increasing affordability of FPGAsilicon – are helping to make programma-ble logic the hardware platform of choicefor a wide and rapidly expanding set of tar-get applications.

Defining the ProblemYou might wonder what problem ESL is try-ing to solve. Why is it needed? To answerthis, let’s consider the following three sce-narios. First, many of today’s design prob-lems originate as a software algorithm in C.The traditional flow to hardware entails amanual conversion of the C source to equiv-alent HDL. ESL adds value by providing anautomatic conversion from HLL to RTL orgates. For those wishing to extract the mostperformance, you can optionally hand-editthe intermediate RTL.

Second, for both power and perform-ance reasons, it is clear that traditionalprocessor architectures are no longer suffi-cient to handle the computational com-plexity of the current and future generationof end applications. ESL is a logical solu-tion that helps overcome the challenges ofprocessor bottlenecks by making an easy

Value Proposition for FPGAsESL technologies are a unique fit for pro-grammable hardware. Together, ESL toolsand FPGAs enable a desktop-based devel-opment environment that allows applica-tions to target hardware using standardsoftware-development flows. You candesign, debug, and download applicationsdeveloped using HLLs to an FPGA boardmuch the same way that you can develop,debug, and download code to a CPU board.

Additionally, with powerful 32-bitembedded processors now a common fea-ture in FPGAs, you can implement a com-plete system – hardware and software – ona single piece of silicon. Not only does thisprovide a high level of component integra-tion, but with the help of ESL tools, it isnow easy and convenient to dynamicallypartition design algorithms between theappropriate hardware (FPGA fabric) andsoftware (embedded CPU) resources on thechip (Figure 2).

ESL brings a productivity advantage toboth current FPGA hardware designers aswell as to a potentially large number ofsoftware programmers. Hardware engi-neers are using HLL-based methods forrapid design prototyping and to bettermanage the complexity of their designs.Software developers, interested in accelerat-ing their CPU bottlenecks, are using ESL

and automated path to hardware-basedacceleration possible.

Third, as the types of appliances thatcan now be deployed on FPGAs becomemore sophisticated, traditional simulationmethods are often not fast enough. ESLmethodologies enable faster system simula-tions, utilizing very high speed transaction-based models that allow you to verifyfunctionality and perform hardware/soft-ware trade-off analysis much earlier in thedesign cycle.

Third Quarter 2006 Xcell Journal 7

System

RTL

Gates

VerilogVHDL

Schematic

1980s 1990s 2000s

Figure 2 – ESL facilitates dynamic hardware/software partitioning conveniently possible in FPGAs.

Figure 1 – Design methodologies are evolving to HLL-based system-level design.

V I E W P O I N T

flows to export computationally criticalfunctions to programmable hardware.

ESL adds value by being able toabstract away the low-level implementa-tion details associated with hardwaredesign. By doing so, the tools simplifyhardware design to such an extent so as torender the hardware practically “invisi-ble” to designers. This helps expand thereach of FPGAs from the traditional base

of hardware designers to a new andpotentially larger group of softwaredesigners, who until now were exclusivelytargeting their applications to processors.As there are many more software develop-ers in the world than hardware designers,this equates to a large new opportunityfor FPGAs.

If you remain skeptical about the roleof HLLs in hardware design, you are notalone. Your concerns might range fromcompiler quality of results and tool ease-of-use issues to the lack of language stan-dardization and appropriate training onthese emerging technologies. The goodnews is that ESL suppliers are aware ofthese challenges and are making substan-tial and sustained investments to advance

the state of the art. These efforts are begin-ning to yield results, as evidenced by anincreasing level of interest from end users.For instance, more than two-thirds of theengineers in a Xilinx® survey expressed akeen interest in learning about the capabil-ities of ESL. Xilinx estimates that hun-dreds of active design seats exist where ESL

methods are being used with FPGAs tosolve practical problems.

You may also be wondering if ESL willsignal the end of current RTL-based designmethodologies. Although there is littledoubt that ESL should be regarded as a dis-ruptive technology, ESL flows today arepositioned to complement, rather thancompete, with HDL methods. In fact, sev-eral ESL tools write out RTL descriptionsthat are synthesized to gate level using cur-rent RTL synthesis. And there will alwaysbe scenarios where design blocks may needto be hand-coded in RTL for maximumperformance. By plugging into the existingwell-developed infrastructure, ESL can helpsolve those design challenges not addressedby current methodologies.

Next StepsIf you are looking for a place to begin yourorientation on ESL and FPGAs, a goodplace to start would be the ESL knowledgecenter at www.xilinx.com/esl. In March2006, Xilinx launched the ESL Initiative –a collaborative alliance with the industry’sleading tool providers to promote the value,relevance, and benefits of ESL tools forFPGAs (Table 1). The website aims toempower you with knowledge and infor-mation about what is available and how toget started, including information on low-cost evaluation platforms that bundleFPGA boards with ESL software anddesign examples.

This issue of the Xcell Journal features arange of articles on ESL and FPGAs writtenby engineers at Xilinx as well as our ESLpartners. These articles are designed to giveyou better insights into emerging ESL con-cepts. You may find that some of these tech-nologies do indeed present a better approachto solving your design challenges. We hopethat some of you will feel inspired to take thenext step and become part of the growingnumber of early adopters of this unique andrevolutionary new methodology.


Partner From HLL Xilinx CPU Description (Key Value)To FPGA Support

Bluespec � SystemVerilog-Like to FPGA (High QoR )

Binachip � � Processor Acceleration (Accelerates Binary Software Code in FPGA)

Celoxica � � Handel-C, SystemC to FPGA (High QoR)

Cebatech � ANSI C to FPGA (Time to Market)

Critical Blue � Co-Processor Synthesis (Accelerates Binary Software Code in FPGA)

Codetronix � � Multi-Threaded HLL for Hardware/Software Design (Time to Market)

Impulse � � Impulse-C to FPGA (Low Cost)

Mimosys � � PowerPC APU-Based Acceleration (Ease of Use)

Mitrion � Mitrion-C to FPGA (Supercomputing Applications)

Mirabilis � High-Speed Virtual Simulation (Perform Hardware/Software Trade-Off Analysis)

Nallatech � Dime-C to FPGA (High-Performance Computing Applications)

Poseidon � � MicroBlaze/PowerPC Acceleration (Analysis, Profiling, and C Synthesis)

SystemCrafter � SystemC to FPGA (Rapid Prototyping)

Teja � � Distributed Processing Using MicroBlaze/PowerPC (High-Speed Packet Processing)

ESL adds value by being able to abstract away the low-level implementation detailsassociated with hardware design. By doing so, the tools simplify hardware design to

such an extent so as to render the hardware practically “invisible” to designers.

Table 1 – Xilinx partners provide a wide spectrum of system-level design solutions.

V I E W P O I N T

by Kevin MorrisEditor – FPGA JournalTechfocus Media, [email protected]

When the XcellJournal asked me towrite a viewpoint

on ESL, I thought, “Hey, why not?They’ve got ‘Journal’ in their name and Iknow all about ESL.” ESL is the designmethodology of the future. ESL will rev-olutionize the electronics industry. ESLwill generate billions of dollars worth ofengineering productivity while the peopleand companies that develop the technol-ogy fight tooth and nail to capture infin-itesimal percentages of that sum.

ESL (electronic system level [design])brings with it a pretentious name and thepotential for manic marketing misrepresen-tation. Although there are excellent ESLproducts and efforts underway, the earlymarket is a good time for the buyer tobeware and for savvy design teams tounderstand that they are likely to be exper-imenting with cutting-edge, nascent tech-

nology. ESL is also currently an over-broadterm, encapsulating a number of interest-ing but diverse tools and technologies.

In order to understand ESL, we need tohack through the typical layers of market-ing hype and confusion to get down to thefundamentals. ESL is not about languagesor layers of abstraction. ESL is about pro-ductivity. Electronic hardware design (thisis important because ESL is a hardwaredesigner’s term) is far too inefficient.Forget the fact that we have boosted pro-ductivity an order of magnitude or twoover the past 20 years because of technolo-gy advances in electronic design automa-tion. Moore’s Law is a much harshermistress than that. Do the math. In twodecades, the gate count of the average dig-ital design platform has risen by a factor ofsomething like 2K. Even if EDA has man-aged a 200x improvement in gates per dig-ital designer day, they are falling behind ata rate of more than 5x per decade.

This effect was observed ad-infinitum inpractically every EDA PowerPoint presenta-tion given in the 1990s. Logarithmic graphsof gate counts were shown skyrocketing

upward exponentially while designer produc-tivity lounged along linearly, leaving what theEDA marketers proudly promoted as “TheGap.” This gap (skillfully highlighted andanimated by black-belt marketing ninjas) wastheir battle cry – their call to arms. Withoutthe help of their reasonably priced powertools, you and your poor, helpless electronicsystem design company would be gobbled upby The Gap, digested by your competitorswho were, of course, immune to Moore’s Lawbecause of the protective superpower shield oftheir benevolent EDA suppliers.

This decade, however, The Gap has goneinto public retirement. Not because it isn’tstill there – it is, growing as fast as ever. Thispause is because we found a killer-app use forall of those un-designable gap gates, giving usa temporary, one-time reprieve from thedoom of design tool deficiency. Our benefac-tor? Programmability. It has been widely doc-umented that devices like FPGAs pay a steepprice in transistor count for the privilege ofprogrammability. Estimates run as high as10x for the average area penalty imposed byprogrammable logic when compared withcustom ASIC technologies. Dedicate 90% of

Join the “Cool” ClubJoin the “Cool” ClubYou’re not doing ESL yet? Loser!You’re not doing ESL yet? Loser!

V I E W P O I N T


your logic fabric transistors to programma-bility, throw in a heaping helping of RAMfor good measure, and you’ve chewed upenough superfluous silicon to leave a man-ageable design problem behind.

By filling several process nodes worth ofnew transistors with stuff you don’t have toengineer, you’ve stayed on the productivitypace without having to reinvent yourself.Now, unfortunately, the bill is coming dueagain. FPGAs have become as complex asASICs were not that long ago, and the deltais diminishing. Design teams around theworld are discovering that designing mil-lions of gates worth of actual FPGA logicusing conventional RTL methodologiescan be a huge task.

In an ASIC, however, this deficiency indesign efficiency was masked by the mask.The cost and effort required to generateand verify ASIC mask sets far exceeded theengineering expense of designing an ASICwith RTL methodologies. Boosting theproductivity of the digital designer wouldbe a bonus for an ASIC company, but nota game-changing process improvement.Even if digital design time and cost went tozero, cutting-edge ASIC developmentwould still be slow and expensive.

In FPGA design, there is no such prob-lem to hide behind, however. Digitaldesign is paramount on the critical path ofmost FPGA projects. Are you an FPGAdesigner? Try this test. Look outside youroffice door up and down the corridor. Doyou see a bunch of offices or cubes filledwith engineers using expensive design-for-manufacturing (DFM) software, correctingfor optical proximity, talking about “ruledecks” and sweating bullets about theprospect of their design going to “tapeout”?Not there, are they? Now look in a mirror.There is your company’s critical path.

With ESL the buzzword-du-jour indesign automation, everyone in the EDAindustry is maneuvering – trying to find away to claim that what they were alreadydoing is somehow ESL so they can be in the“cool” club. As a result, everything frombubble and block diagram editors to trans-action-level simulation to almost-high-levelhardware synthesis has been branded “ESL”and rushed to the show floor at the Design

processing or video compression algorithmprobably bears no resemblance to the nativedesign tongue of the engineer developingpacket-switching algorithms, for example.As a result, we are likely to see two differentschools of thought in design description.One group will continue to pursue general-purpose specification techniques, while theother charts a course of domain-specificity.

In the software world, there is ampleprecedent for both ways of thinking. Eventhough software applications encompass ahuge gamut from business to entertainmentto scientific, most software engineers man-age to develop their applications using oneof the popular general-purpose languageslike C/C++ or Java. On the other hand,some specialized application types such asdatabase manipulation and user interfacedesign have popularized the use of lan-guages specific to those tasks. Software is agood exemplar for us, because programmershave been working at the algorithmic levelof abstraction for years. Hardware design isonly just now heading to that space.

The early ESL tools on the market nowaddress specific hot spots in design. Someprovide large-scale simulation of complex,multi-module systems. Others offer domain-specific synthesis to hardware from languageslike C/C++ or MATLAB’s M, aimed ateverything from accelerating software algo-rithms and offloading embedded processorsto creating super-high-performance digitalsignal processing engines for algorithms thatdemand more than what conventional VonNeumann processors can currently deliver.

These tools are available from a variety ofsuppliers, ranging from FPGA vendorsthemselves to major EDA companies tohigh-energy startups banking on hitting itbig based on the upcoming discontinuity indesign methodology. You should be tryingthem out now. If not, plan on being leftbehind by more productive competitors overthe next five years. At the same time, youshould realize that today’s ESL technologiesare still in the formative stage. These are nas-cent engineering tools with a long way to gobefore they fulfill the ultimate promise ofESL – a productivity leap in digital designmethodology that will ultimately allow us tokeep pace with Moore’s Law.

Automation Conference, complete withblinking lights, theme music, and special,hush-hush secret sneak-preview demosserved with shots of espresso in exclusiveprivate suites.

FPGAs and ESL are a natural marriage.If we believe our premise that ESL technol-ogy delivers productivity – translating intoa substantial reduction in time to market –we see a 100% alignment with FPGA’s keyvalue proposition. People need ESL tech-niques to design FPGAs for the same reasonthey turned to FPGAs in the first place –

they want their design done now. Second-order values are aligned as well, as bothFPGA platforms and ESL tools deliver dra-matically increased flexibility. When designchanges come along late in your productdevelopment cycle, both technologies standready to help you respond rapidly.

With all of this marketing and uncer-tainty, how do you know if what you’redoing is actually ESL? You certainly do notwant all of the other designers pointing andlaughing at engineering recess if you showup on the playground with some lame toolthat isn’t the real thing. Although there is noeasy answer to that question, we can pro-vide some general guidelines. First, if youare describing your design in any languagein terms of the structure of the hardware,what you are doing probably isn’t ESL.Most folks agree that the path to productiv-ity involves raising the abstraction layer.

If we are upping our level of abstraction,how will the well-heeled hardware engineerbe describing his design circa 2010? Not inRTL. Above that level of abstraction, how-ever, a disquieting discontinuity occurs.Design becomes domain-specific. The natu-ral abstraction for describing a digital signal

FPGA platforms

and ESL tools

deliver dramatically

increased flexibility.

V I E W P O I N T


To complement our flagship publication Xcell Journal, we’ve recently launched three new technology magazines:

� Embedded Magazine, focusing on the use of embedded processors in Xilinx® programmable logic devices.

� DSP Magazine, focusing on the high-performance capabilities of our FPGA-based reconfigurable DSPs.

� I/O Magazine, focusing on the wide range of serial and parallel connectivity options available in Xilinx devices.

In addition to these new magazines, we’ve created a family of Solution Guides, designed to provide useful information on a wide range of hot topics such as

Broadcast Engineering, Power Management, and Signal Integrity.Others are planned throughout the year.

What’s New

Rwww.xilinx.com/xcell/

by Matt AuburyVice President, [email protected]

The demand for high-performance imag-ing systems continues to grow. In broadcastand display applications, the worldwideintroduction of HDTV has driven datarates higher and created a need to enhancelegacy standard definition (SD) content fornew displays. Simultaneously, rapidchanges in display technology includenovel 3D displays and new types of emis-sive flat panels.

Concurrently, machine vision techniquesthat were once confined to manufacturingand biomedical applications are finding usesin new fields. Automotive applicationsinclude lane departure warning, drowsydriver detection, and infrared sensor fusion.Security systems currently in developmentcan automatically track people in closed-cir-cuit TV images and analyze their move-ments for certain patterns of behavior. Inhomes, video-enabled robots will serve asentertainment and labor-saving devices.


Bringing Imaging to the System Level with PixelStreams

Bringing Imaging to the System Level with PixelStreams

You can develop FPGA-based HDTV and machinevision systems in minutesusing Celoxica’s suite of ESLtools and the PixelStreamsvideo library.

You can develop FPGA-based HDTV and machinevision systems in minutesusing Celoxica’s suite of ESLtools and the PixelStreamsvideo library.

Cost, processing power, and time tomarket are critical issues for these newapplications. With flexible data paths,high-performance arithmetic, and the abil-ity to reconfigure on the fly, FPGAs areincreasingly seen as the preferred solution.However, designers of reconfigurable imag-ing systems face some big hurdles:

• Getting access to a rich library ofreusable IP blocks

• Integrating standard IP blocks withcustomized IP

• Integrating the imaging part of the sys-tem with analysis, control, and net-working algorithms running on a CPUsuch as a PowerPC™ or Xilinx®

MicroBlaze™ processor

In this article, I’ll introduce PixelStreamsfrom Celoxica, an open framework for videoprocessing that has applicability in bothbroadcast and machine vision applications.PixelStreams leverages the expressive powerof the Celoxica DK Design Suite to developefficient and high-performance FPGA-based solutions in a fraction of the time oftraditional methods.

Filters and StreamsA PixelStreams application is built from anetwork of filters connected together bystreams. Filters can generate streams (forexample, a sync generator), transformstreams (for example, a colorspace con-verter), or absorb streams (for example,video output).

Streams comprise flow control, datatransport, and high-level type information.The data component in turn contains:

• An active flag (indicating that this pixelis in the current region-of-interest)

• Optional pixel data (in 1-bit, 8-bit, orsigned 16-bit monochrome, 8-bitYCbCr, or RGB color)

• An optional (x, y) coordinate

• Optional video sync pulses (horizontaland vertical sync, blanking, and fieldinformation for interlaced formats)

Combining all of these componentsinto a single entity gives you great flexibili-

The second domain imports the chan-nel using extern and receives the streamfrom it with a PxsReceive() filter:

// domain 2extern chan X;...PxsReceive (&S1, &X);

Custom FiltersPixelStreams comes “batteries included,”with more than 140 filters, many ofwhich can perform multiple functions. Asample of this range is shown in Table 1.Every PixelStreams filter is provided withcomplete source code. Creating customfilters is often just a matter of modifyingone of the standard filters. New filters areaccessible from either the graphical editoror from code.

ty. Filters can perform purely geometrictransforms by modifying only the coordi-nates, or create image overlays just by mod-ifying the pixel data.

Filters use the additional type infor-mation that a stream provides in twoways: to ensure that they are compatiblewith the stream they are given (for exam-ple, a sync generator cannot be attacheddirectly to a video output because it doesnot contain any pixel data), and to auto-matically parameterize themselves. Filtersare polymorphic; a single PxsConvert()filter handles colorspace conversionbetween each of the pixel data formats(which includes 20 different operations).

Flow ControlFlow control is handled by a downstream“valid” flag (indicating the validity of thedata component on a given clock cycle)and an upstream “halt” signal. This com-bination makes it easy to assemble multi-rate designs, with pixel rates varying(dynamically if necessary) from zero up tothe clock rate. Simple filters are still easyto design. For example, a filter that mod-ifies only the pixel components would usea utility function to copy the remainingcomponents (valid, active, coordinates,and sync pulses) from its inputs to out-puts while passing the halt signalupstream. More complex filters that needto block typically require buffering attheir inputs, which is easily handled byusing the generic PxsFIFO() filter.

Sophisticated designs will require filternetworks in multiple clock domains.These are easy to implement using theDK Design Suite, which builds efficientand reliable channels between clockdomains and sends and receives filtersthat convert streams to channel commu-nications. The first domain simplydeclares the channel (in this case, bybuilding a short FIFO to maximizethroughput) and uses the PxsSend() filterto transmit the stream:

// domain 1

chan X with { fifolength = 16 };

...

PxsSend (&S0, &X);


• Image analysis and connected component(“blob”) labeling

• Image arithmetic and blending

• Clipping and region-of-interest

• Colorspace conversion and dithering

• Coordinate transforms, scaling, and warping

• Flow control, synchronization, multiplexing,FIFOs, and cross-clock domain transport

• Convolutions, edge detection, and non-linear filtering

• Frame buffering and de-interlacing

• Color look-up-tables (LUTs)

• Grayscale morphology

• Noise generators

• Video overlays, text and cursor generators, line plotter

• Sync generators for VGA, TV, HDTV

• Video I/O

Table 1 – An overview of the provided source-level IP

Concept to FPGA Hardware in 10 MinutesLet’s take a simple HDTV application as anexample. We have an SD video stream thatwe want to de-interlace, upscale, applygamma correction to, then dither and out-put to an LCD.

We can assemble the blocks for this veryquickly using the PixelStreams graphical edi-tor. The TV (SD) input undergoes color-space conversion from YCbCr to RGB andis stored in a frame buffer that implementssimple bob de-interlacing. Pixels are fetchedfrom the frame buffer, with the upscalingachieved by modifying the lookup coordi-nates. The output stream is gamma-correct-ed using a color LUT, dithered, and output.Assembling this network takes only a fewminutes. The editor uses heuristics to set thestream parameters automatically. The finaldesign is shown in Figure 1.

One click generates code for this net-work, with a corresponding project andworkspace, and launches the DK DesignSuite. The generated code is shown inFigure 2. For this simple application eachfilter is a single process running in a paral-lel “par” block. More sophisticated designswill have C-based sequential code runningin parallel to the filters.

One more click starts the build processfor our target platform (in this case,Celoxica’s Virtex™-4 based RC340 imag-ing board). The build process automatical-ly runs ISE™ software place and route andgenerates a bit file, taking about five min-utes. One last click and the design is down-loaded to the board over USB, as shown inFigure 3.

As designs become more complex, itbecomes easier to code them directly ratherthan use the graphical editor. This enables ahigher level of design abstraction, with theability to construct reusable hierarchies offilters and add application control logic.

SimulationSimulation is handled by the DK DesignSuite’s built-in cycle-based simulator andutilizes its virtual platform technology toshow video input and output directly onscreen. Celoxica’s co-simulation managercan simulate designs incorporating CPUsand external RTL components.


#define ClockRate 65000000#include “pxs.hch”

void main (void){

/* Streams */PXS_I_S (Stream0, PXS_RGB_U8);PXS_I_S (Stream1, PXS_YCbCr_U8);PXS_PV_S (Stream2, PXS_EMPTY);PXS_PV_A (Stream3, PXS_EMPTY);PXS_PV_A (Stream4, PXS_RGB_U8);PXS_PV_S (Stream6, PXS_RGB_U8);PXS_PV_A (Stream7, PXS_RGB_U8);PXS_PV_S (Stream8, PXS_RGB_U8);

/* Filters */par{

PxsTVIn (&Stream1, 0, 0, ClockRate);PxsConvert (&Stream1, &Stream0);PxsPalPL2RAMFrameBufferDB (&Stream0, &Stream3, &Stream4, Width, PXS_BOB,

PalPL2RAMCT(0), PalPL2RAMCT(1), ClockRate);PxsVGASyncGen (&Stream2, Mode);PxsScale (&Stream2, &Stream3, 704, 576, 1024, 768);PxsStaticLUT (&Stream4, &Stream7, PxsLUT8Square);PxsOrderedDither (&Stream8, &Stream6, 5);PxsVGAOut (&Stream6, 0, 0, ClockRate);PxsRegenerateCoord (&Stream7, &Stream8);

}}

Figure 2 – Generated source code for example application

Figure 1 – Example application built using the PixelStreams graphical editor


SynthesisPixelStreams capitalizes on the strengths ofthe DK Design Suite’s built-in synthesis toachieve high-quality results on XilinxFPGAs. For example:

• When building long pipelines, it is com-mon that certain components of streamswill not be modified. These will be effi-ciently packed into SRL16 components.

• When doing convolutions or coordinatetransforms, MULT18 or DSP48 com-ponents will be automatically targeted(or if the coefficients are constant, effi-cient constant multipliers are built).

• Line buffers and FIFOs will use eitherdistributed or pipelined block RAMresources, which are automatically tiled.

• The data components of a stream arealways registered, to maximize the tim-ing isolation between adjacent filters.

• DK’s register retiming can move logicbetween the pipeline stages of a filter tomaximize clock rate (or minimize area).

In addition, DK can generate VHDL orVerilog for use with third-party synthesistools such as XST or Synplicity and simu-lation tools such as Aldec’s Active HDL orMentor Graphics’s ModelSim.

System IntegrationA complete imaging system typicallycomprises:

• A PixelStreams filter network, withadditional C-based control logic

• A MicroBlaze or PowerPC processormanaging control, user interface, orhigh-level image interpretation functions

• VHDL or Verilog modules (possibly asthe top level)

To deal with CPU integration,PixelStreams provides a simple “PxsBus” thatcan be readily bridged to CPU buses such asAMBA or OPB or controlled remotely (overPCI or USB). This is purely a control bus,allowing filters to be controlled by a CPU(for adding menus or changing filter coeffi-cients) or to provide real-time data back from

a filter to the controlling application (such asthe result of blob analysis).

To support integration with RTL flows,PixelStreams offers a PxsExport() filter thatpackages a filter network into a modulethat can be instantiated from VHDL orVerilog. Alternatively, an RTL module canbe instantiated within a PixelStreams toplevel using PxsImport(). Used together,pre-synthesized filter networks can be rap-idly instantiated and reduce synthesis time.

ConclusionCombining the two main elements of ESLtools for Xilinx FPGAs – C- and model-based design – PixelStreams offers auniquely powerful framework for imple-menting a variety of imaging applications.The provision of a wide range of standardfilters combined with features for integra-tion into RTL flows and hardware/softwareco-design makes it easy to add imaging fea-tures to your system-level designs.

Future versions of PixelStreams willextend both its flexibility as a frameworkand the richness of the base library of IP.We plan to add additional pixel formats(such as Porter-Duff based RGBA) andcolor depths and increase the availableperformance by introducing streams thattransfer multiple pixels per clock cycle.We also intend to introduce the ability toeasily transmit streams over other trans-ports, such as USB, Ethernet, and high-speed serial links, add more filters formachine vision features such as cornerdetection and tracking, and offer addi-tional features for dynamically generatinguser interfaces.

You can use PixelStreams to target anyprototyping or production board identifiedfor a project with the addition of simpleI/O filters. It can immediately targetCeloxica’s range of RC series platforms andis fully compatible with the ESL Starter Kitfor Xilinx FPGAs. For more information,visit www.celoxica.com/xilinx, www.celoxica.com/pxs, and www.celoxica.com/rc340.

Future versions of PixelStreams will extend both its flexibility as a framework and the richness of the base library of IP.

Figure 3 – Application running in hardware

by David PellerinCTO Impulse Accelerated Technologies, Inc. [email protected]

Scott Thibault, Ph.D.PresidentGreen Mountain Computing Systems, [email protected]

Language-based ESL tools for FPGAs haveproven themselves viable alternatives to tra-ditional hardware design methods, bring-ing FPGAs within the reach of softwareapplication developers. By using software-to-hardware compilation, software devel-opers now have greater access to FPGAs ascomputational resources.

What has often been overlooked by tra-ditional hardware designers, however, is theincreased potential for design exploration,iterative performance optimization, andhigher performance when ESL tools areused in combination with traditionalFPGA design methods.

In this article, we’ll describe the role ofC-to-hardware ESL tools for iterative designexploration and interactive optimization.We’ll present techniques for evaluating alter-native implementations of C-language accel-

erators in Xilinx® FPGAs and explore therelative performance of fixed- and floating-point FPGA algorithms.

Crossing the Abstraction GapBefore ESL tools for FPGAs existed, it wasnecessary to describe all aspects of an FPGA-based application using relatively low-levelmethods such as VHDL, Verilog, or evenschematics. These design methods are stilladequate, and in many cases preferable, fortraditional FPGA-based hardware applica-tions. However, when using traditionalhardware design methods for creating com-plex control or computationally intenseapplications, a significant gap in abstraction(Figure 1) can exist between the originalsoftware algorithm and its correspondingsynthesizable hardware implementation, asexpressed in VHDL or Verilog. Crossing thisgap may require days or weeks of tediousdesign conversion, making iterative designmethods difficult or impossible to manage.

Tools providing C compilation andoptimization for FPGAs can help softwareand hardware developers cross this gap byproviding behavioral algorithm-level meth-ods of design. Even for the most experi-enced FPGA designers, the potential existsfor design improvements using such tools.

Although it may seem counterintuitive toan experienced hardware engineer, usinghigher level tools can actually result in high-er performance applications because of thedramatically increased potential for designexperimentation and rapid prototyping.

Iterative Optimization Is the KeyTo understand how higher level tools canactually result in higher performanceapplications, let’s review the role of soft-ware compilers for more traditional non-FPGA processors.

Modern software compilers (for C, C++,Java, and other languages) perform muchmore than simple language-to-instructionconversions. Modern processors, computingplatforms, and operating systems have adiverse set of architectural characteristics, buttoday’s compilers are built in such a way thatyou can (to a great extent) ignore these manyarchitectural features. Advanced optimizingcompilers take advantage of low-level proces-sor and platform features, resulting in faster,more efficient applications.

Nonetheless, for the highest possibleperformance, you must still make program-ming decisions based on a general under-standing of the target. Will threading anapplication help or hurt performance?


Evaluating Hardware AccelerationStrategies Using C-to-Hardware ToolsSoftware-based methods enable iterative design and optimization for performance-critical applications.

Should various types of data be stored ondisk, maintained in heap memory, oraccessed from a local array? Is there alibrary function available to perform a cer-tain I/O task, or should you write a customdriver? These questions, and others likethem, are a standard and expected part ofthe application development process.

For embedded, real-time, and DSPapplication developers there are even moredecisions to be made – and more dramaticperformance penalties for ignoring the real-ities of the target hardware. For these plat-forms, the ability to quickly experimentwith new algorithmic approaches is an

success; it is not yet possible to simplyignore the nature of the target, nor is itpractical to consider legacy applicationporting to be a simple recompilation ofexisting source code. For hardware engi-neers, it means that software-to-hardwaretools should be viewed as complements to –not replacements for – traditional hardwaredevelopment methods. For both softwareand hardware designers, the use of higherlevel tools presents more opportunities forincreasing performance through experi-mentation and fast prototyping.

Practically speaking, the initial results ofsoftware-to-hardware compilation from Clanguage descriptions are not likely to equalthe performance of hand-coded VHDL, butthe turnaround time to get those first resultsworking may be an order of magnitude bet-ter. Performance improvements can thenoccur iteratively, through an analysis of howthe application is being compiled to thehardware and through the experimentationthat C-language programming allows.

Graphical tools like those shown in Figure2 can help provide initial estimates of algo-rithm performance such as loop latencies andpipeline throughput. Using such tools, youcan interactively change optimizationoptions or iteratively modify and recompileC code to obtain higher performance. Suchdesign iterations may take only a matter ofminutes when using C, whereas the sameiterations may require hours or even dayswhen using VHDL or Verilog.

Analyzing AlgorithmsTo illustrate how design iteration and C-lan-guage programming can help when proto-typing algorithms, consider a DSP functionsuch as a fast Fourier transform (FFT) or animage-processing function such as a filterthat must accept sample data on its inputsand generate the resulting filtered values onits outputs. By using C-to-hardware tools, wecan easily try a variety of different implemen-tation strategies, including the use of differ-ent pipeline depths, data widths, clock rates,and numeric formats, to find a combinationof size (measured as FPGA slice count) andspeed (measured both as clock speed and datathroughput) that meets the specific require-ments of a larger hardware/software system.

important enabler. It is the primary reasonthat C programming has become animportant part of every embedded andDSP programmer’s knowledge base.Although most embedded and DSP pro-grammers understand that higher perform-ance is theoretically possible usingassembly language, few programmers wishto use such a low level of abstraction whendesigning complex applications.

What does all of this mean for FPGAprogrammers? For software engineers con-sidering FPGAs, it means that someunderstanding of the features and con-straints of FPGA platforms is critical to


System Modeling(typically using C)

Design Partitioning(HW/SW hand-off)

Software Development(embedded SW tools)

Hardware Development(HDL tools)

Design Refinement(SW and HW tools)

High-LevelSpecification

InterfacesFrozen

SoftwareComplete

PrototypeComplete

ProductComplete

The Abstraction Gap

Figure 1 – Crossing the abstraction gap: in system-level design, hardware development can be the bottleneck when creating a working prototype.

Figure 2 – A dataflow graph allows C programmers to analyze the generated hardware and perform explorative optimizations to balance trade-offs between size and speed. Illustrated in this graph is the final stage of a six-stage pipelined loop. This graph also helps C programmers

understand how sequential C statements are parallelized and optimized.

To demonstrate these ideas, we con-ducted a series of tests with a 32-bit imple-mentation of the FFT algorithm. The FFTwe chose includes a 32-bit FIFO input, a32-bit FIFO output, and two clocks, allow-ing the FFT to be clocked at a different ratethan the embedded processor with which itcommunicates. The algorithm itself isdescribed using relatively straightforward,hardware-independent C code. This imple-mentation of the FFT uses a main loopthat performs the required butterfly opera-tions on the incoming data to generate theresulting filtered output.

Because this FFT algorithm is imple-mented as a single inner loop representinga radix-4 butterfly, we can use the auto-matic pipelining capabilities of the ImpulseC compiler to try a variety of pipelinestrategies. Pipelining can introduce a highdegree of parallelism in the generated logic,allowing us to achieve higher throughput atthe expense of additional hardware. Thispipeline itself may be implemented in avariety of ways, depending on how fast wewish to clock the FFT.

Another aspect to this algorithm (andothers like it) is that it can be implementedusing either fixed- or floating-point math.For a high-end processor such as an IntelPentium or AMD Opteron, the choice offloating point is obvious. But for an FPGA,floating point may or may not make sensegiven the lack of dedicated floating-pointunits and the corresponding expense (interms of FPGA slice count) of generatingadditional hardware.

Fortunately, the Impulse C tools allowthe use of either fixed- or floating-pointoperations, and Xilinx tools are capable ofinstantiating floating-point FPGA coresfrom the generated logic. This makes per-formance comparisons relatively easy. Inaddition, the Impulse tools allow pipelin-ing for loops to be selectively enabled, andpipeline size and depth to be indirectlycontrolled through a delay constraint.

Table 1 shows the results of two differ-ent optimization and pipelining strategiesfor the 32-bit FFT, for both the fixed- andfloating-point versions. We generatedresults using the Impulse C version 2.10tools in combination with Xilinx ISE™

software version 8.1, and targeting a XilinxVirtex™-4 LX-25 device. Many morechoices are actually possible during iterativeoptimization, depending on the goals of thealgorithm and the clocking and size con-straints of the overall system. In this case,there is a clear benefit from enablingpipelining in the generated logic. There isalso, as expected, a significant area penaltyfor making use of floating-point operationsand a significant difference in performancewhen pipelining is enabled.

Table 2 shows a similar test performedusing a 5x5 kernel image filter. This filter,which has been described using two paral-lel C-language processes, demonstrateshow a variety of different pipeline opti-mization strategies (again specified using adelay constraint provided by Impulse C)can quickly evaluate size and speed trade-offs for a complex algorithm. For all casesshown, the image filter data rate (the rateat which pixels are processed) is exactly onehalf the clock rate.

For software applications targetingFPGAs, the ability to exploit parallelism(through instruction scheduling, pipelin-ing, unrolling, and other automated ormanual techniques) is critical to achieving

quantifiable improvements over tradition-al processors.

But parallelism is not the whole story;data movement can actually become a moresignificant bottleneck when using FPGAs.For this reason, you must balance the accel-eration of critical computations and inner-code loops against the expense of movingdata between hardware and software.

Fortunately, modern tools for FPGAcompilation include various types of analy-sis features that can help you more clearlyunderstand and respond to these issues. Inaddition, the ability to rapidly prototypealternative algorithms – perhaps using verydifferent approaches to data managementsuch as data streaming, message passing, orshared memory – can help you more quick-ly converge on a practical implementation.

ConclusionWith tools providing C-to-hardware compi-lation and optimization for Xilinx FPGAs,software and hardware developers can crossthe abstraction gap and create faster, moreefficient prototypes and end products. Todiscover what C-to-hardware technologiescan do for you, visit www.xilinx.com/esl orwww.impulsec.com/xilinx.


FFT Pipelining Slices FFs LUTs Clock Total FFTData Type Enabled? Used Used Used Frequency Cycles

Fixed Point No 3,207 2,118 5,954 80 MHz 1,536

Fixed Point Yes 2,810 2,347 5,091 81 MHz 261

Floating Point No 10,917 7,298 20,153 88 MHz 10,496

Floating Point Yes 10,866 7,610 19,855 74 MHz 269

Stage Delay Slices FFs LUTs Clock PipelineConstraint Used Used Used Frequency Stages

300 1,360 1,331 2,186 60 MHz 5

200 1,377 1,602 2,209 85 MHz 7

150 1,579 2,049 2,246 99 MHz 9

100 1,795 2,470 2,392 118 MHz 11

75 2,305 3,334 2,469 139 MHz 15

Table 1 – 32-bit FFT optimization results for fixed- and floating-point versions, with and without pipelining enabled

Table 2 – 16-bit image filter optimization results for various pipelining and stage delay combinations

by Deepak ShankarPresident and CEOMirabilis Design [email protected]

Performance analysis and early architectureexploration ensures that you will select theright FPGA platform and achieve optimalpartitioning of the application onto thefabric and software. This early explorationis referred to as rapid visual prototyping.Mirabilis Design’s VisualSim software sim-ulates the FPGA and board using modelsthat are developed quickly using pre-built,parameterized modeling libraries in agraphical environment.

These library models resemble the ele-ments available on Xilinx® FPGAs, includ-ing PowerPC™, MicroBlaze™, andPicoBlaze™ processors; CoreConnect;DMA; interrupt controllers; DDR; blockRAM; LUTs; DSP48E; logic operators;and fabric devices. The components areconnected to describe a given Xilinx Virtexplatform and simulated for different oper-ating conditions such as traffic, user activi-ty, and operating environment.

More than 200 standard analysis out-puts include latency, utilization, through-put, hit ratio, state activity, context

switching, power consumption, andprocessor stalls. VisualSim acceleratesarchitecture exploration by reducing typ-ical model development time frommonths to days.

I can illustrate the advantages of earlyarchitecture exploration with an examplefrom one of our customers, who was expe-riencing difficulty with a streaming mediaprocessor implemented using a Virtex™-4device. The design could not achieve therequired performance and was droppingevery third frame. Utilization at all of theindividual devices was below 50%. A visu-al simulation that combined both theperipheral and the FPGA identified thatthe video frames were being transferred atthe same clock sync as the audio framesalong a shared internal bus.

As the project was in the final stages ofdevelopment, making architecture changesto address the problem would have delayedshipment by an additional six months.Further refinement of the VisualSimmodel found that by giving the audioframes a higher priority, the design couldachieve the desired performance, as theaudio frames would also be available forprocessing. The project schedule wasdelayed by approximately 1.5 months.

If the architecture had been modeledearly in the design cycle, the design cyclecould have been reduced by 3 months,eliminating the 1.5-month re-spin to get tomarket approximately 5 months sooner.Moreover, with a utilization of 50%, con-trol processing could have been moved tothe same FGPA. This modification mighthave saved one external processor, a DDRcontroller, and one less memory board.

Rapid Visual PrototypingRapid visual prototyping can help youmake better partitioning decisions.Evaluations with performance and archi-tectural models can help eliminate clearlyinferior choices, point out major problemareas, and evaluate hardware/softwaretrade-offs. Simulation is cheaper and fasterthan building hardware prototypes and canalso help with software development,debugging, testing, documentation, andmaintenance. Furthermore, early partner-ship with customers using visual prototypesimproves feedback on design decisions,reducing time to market and increasing thelikelihood of product success (Figure 1).

A design-level specification captures anew or incremental approach to improvesystem throughput, power, latency, utiliza-


Accelerating Architecture Exploration for FPGA Selection and System DesignAccelerating Architecture Exploration for FPGA Selection and System DesignYou can create an optimized specification that achieves performance, reliability, and cost goals for use in production systems.You can create an optimized specification that achieves performance, reliability, and cost goals for use in production systems.

tion, and cost; these improvements are typi-cally referred to as price/power/performancetrade-offs. At each step in the evolution of adesign specification, well-intentioned modi-fications or improvements may significantlyalter the system requirements. The timerequired to evaluate a design modificationbefore or after the system design process hasstarted can vary dramatically, and a visualprototype will reduce evaluation time.

To illustrate the use of the rapid visualprototype, let’s consider a Layer 3 switchimplemented using a Virtex FPGA. TheLayer 3 switch is a non-blocking switchand the primary consideration is to main-tain total utilization across the switch.

Current SituationIn product design, three factors are certain:specifications change, non-deterministictraffic creates performance uncertainty,and Xilinx FPGAs get faster. Productsoperate in environments where the pro-cessing and resource consumption are afunction of the incoming data and useroperations. FPGA-based systems used forproduction must meet quality, reliability,and performance metrics to address cus-tomer requirements. What is the optimaldistribution of tasks into hardware acceler-ation and software on FPGAs and other

• Selecting functions requiring a co-processor

• Determining optimal interface speedsand pins required

• Exploring block RAM allocationschemes, cache and RAM speeds, off-chip buffering, and impact ofredundant operators

An analysis conducted using VisualSimincludes packet size versus latency, protocoloverhead versus effective bandwidth, andresource utilization.

In reference to the Layer 3 example, yourdecisions would include using:

• The on-chip PowerPC or externalprocessor for routing operations

• Encryption algorithms using the DSPfunction blocks or fabric multipliersand adders

• A dedicated MicroBlaze processor for traffic management or fabric

• PowerPC for control or proxy rules processing

• TCP offload using an external co-processor or MicroBlaze processor

Can a set of parallel PicoBlaze processorswith external SDRAM support in-line spy-ware detection? What will the performancebe when the packet size changes from 256bytes to 1,512 bytes? How can you plan forfuture applications such as mobile IP?

You can extend the exploration to con-sider the interfaces between the FPGA andboard peripherals, such as SDRAM. As the

board devices? How can you determine thebest FPGA platform to meet your productrequirements and attain the highest per-formance at the lowest cost?

Early Exploration SolutionVisualSim provides pre-built componentsthat are graphically instantiated to describehardware and software architectures. Theapplications and use cases are described asflow charts and simulated on theVisualSim model of the architecture usingmultiple traffic profiles. This approachreduces the model construction burdenand allows you to focus on analysis andinterpretation of results. It also helps youoptimize product architectures by runningsimulations with application profiles toexplore FPGA selection; hardware versussoftware decisions; peripheral devices ver-sus performance; and partitioning ofbehavior on target architectures.

Design OptimizationYou can use architecture exploration(Figure 2) to optimize every aspect of anFPGA specification, including:

• Task distribution on MicroBlaze andPowerPC processors

• Sizing the processors

A C

B

A C

B

Traffic Analysis

BA C

Done

A C

B

Block Diagram

(Step 1)

System Model

(Step 2)

A C

B

Traffic Analysis

BA C

YESDone

Start A C

B

A C

B

A C

B

Traffic AnalysisA C

B

A C

B

TrafficAnalysis

Parameters

Simulator

1. Questions answered by model

2. Abstraction of A, B, C

3. Data Structures to support (1) and (2)

4. Topology

Model Choices (Step 3)Add Block Details (Step 4)

BA CBA C

Validate Validate Validate

1. Explore different parameters

2. Explore different topologies

3. Examine utilization, latency, and

throughput

Run Simulation (Step 5)

Close to answer

Try parameters

Topology changes Done

No, not even close

Rethink design concept,

Block diagram

Return to Step 1

Figure 2 – Architecture model of the FPGAplatform and peripheral using VisualSim

FPGA components

Figure 1 – Translating a system concept into rapid visual prototyping



PowerPC will be sharing the bus with theMicroBlaze processor, the effective busthroughput is a function of the number ofdata requests and the size of the local blockRAM buffers. For example, you couldenhance the MicroBlaze processor with aco-processor to do encryption at the bitlevel in the data path. You could alsouse the CoreConnect bus to connectthe peripheral SDRAM to thePowerPC while the DDR2 is used forthe MicroBlaze processor.

You can reuse the VisualSim archi-tecture model for exploration of thesoftware design, identifying high-resource consumption threads, balanc-ing load across multiple MicroBlazeprocessors, and splitting operationsinto smaller threads. If a new softwaretask or thread has data-dependent priorities,exploration of the priorities and task-arrivaltime on the overall processing is a primarymodeling question. If you change the priori-ty on a critical task, will this be sufficient toimprove throughput and reduce task latency?In most cases, this will be true, but there maybe a relative time aspect to a critical task thatcan reduce latencies on lower priority taskssuch that both benefit from the new ordering.If peak processing is above 80% for a systemprocessing element, then the system may bevulnerable to last-minute tasks added, or tofuture growth of the system itself.

Model ConstructionSystem modeling of the Layer 3 switch(Figure 3) starts by compiling the list offunctions (independent of implementa-tion), expected processing time, resourceconsumption, and system performancemetrics. The next step is to capture a flowdiagram in VisualSim using a graphicalblock diagram editor (Figure 3). The flowdiagrams are UML diagrams annotatedwith timing information. The functions inthe flow are represented as delays; timedqueues represent contention; and algo-rithms handle the data movement. Theflow diagram comprises data processing,control, and any dependencies.

Data flow includes flow and trafficmanagement, encryption, compression,routing, proxy rules, and TCP protocol

handling. The control path contains thecontroller algorithm, branch decisiontrees, and weighted polling policies.VisualSim builds scenarios to simulate themodel and generate statistics. The scenar-ios are multiple concurrent data flowssuch as connection establishment (slow

path); in-line data transfer after setup ofsecure channel (fast path); and protocol-and data-specific operation sequencesbased on data type identification.

You can use this model of the timedflow diagram for functional correctness andvalidation of the flows. VisualSim uses ran-dom traffic sequences to trigger the model.The traffic sequences are defined datastructures in VisualSim; a traffic generatoremulates application-specific traffic. Thistimed flow diagram selects the FPGA plat-form and conducts the initial hardware andsoftware partitioning. The flow diagrammodel defines the FPGA components andperipheral hardware using the FPGAModeling Toolkit.

The functions of the flow diagram aremapped to these architecture components.For each function, VisualSim automatical-ly collects the end-to-end delay and num-ber of packets processed in a time period.For the architecture, VisualSim plots theaverage processing time, utilization, and

effective throughput (Figure 4).These metrics are matched againstthe requirements. Exploration ofthe mapping and architecture ispossible by varying the link andreplacing the selected FPGA withother FPGAs.

The outcome of this effort willbe selecting the right FPGA family,correctly sizing the peripherals andthe right number of block RAMs,DSP blocks, and MicroBlaze

processors. You can add overhead to themodels to capture growth requirements andensure adequate performance.

ConclusionEarly architecture exploration ensures ahighly optimized product for quality, relia-bility, performance, and cost. This providesdirection for implementation plans,reduces the amount of tests you need toconduct, and has the ability to shrink thedevelopment cycle by almost 30%.

VisualSim libraries of standard FPGAcomponents, flow charts defining thebehavior, traffic models, and pre-builtanalysis probes ensure that system designis no longer time consuming, difficult toperform, and providing questionableresults. The reduction in system modelingtime and availability of standard compo-nent models provides a single environ-ment for designers to explore bothhardware and software architectures.

For a free 21-day trial of the FPGAModeling Toolkit, including the MicroBlazeand PowerPC models, register atwww.mirabi l i sde s i gn . com/webpage s /evaluation/mdi_evaluation.htm. To learnmore about VisualSim, visit www.mirabilisdesign.com, where there are mod-els embedded in the HTML pages. Youcan modify parameters and execute fromwithin your web browser without havingto download custom software.

Figure 4 – Analysis output for the Layer 3switch design

Figure 3 – Flow chart describing the application flow diagram in VisualSim

by Per LjungPresidentCodetronix [email protected]

With the advent of ever-larger FPGAs, bothsoftware algorithm developers and hardwaredesigners need to be more productive.Without an increase in abstraction level,today’s complex software systems cannot berealized, both because of a lack of resources(time, money, developers) and an inabilityto manage the complexity. Electronic sys-tem level (ESL) design offers a solutionwhere an abstraction level higher than HDLwill result in higher productivity, higherperformance, and higher quality.

In this article, I’ll show you how to usethe high-level multi-threaded languageMobius to quickly and efficiently imple-ment embedded control systems on FPGAs.In particular, I’ll demonstrate the imple-mentation of a floating-point embeddedcontrol system for a Xilinx® Virtex™-IIFPGA in a few lines of Mobius source code.


Rapid Development of Control Systems with Floating PointRapid Development of Control Systems with Floating PointGenerate high-quality synthesizable HDL with the Mobius compiler.Generate high-quality synthesizable HDL with the Mobius compiler.

MobiusMobius is a tiny high-level multi-threadedlanguage and compiler designed for therapid development of hardware/softwareembedded systems. A higher abstractionlevel and ease of use lets you achievegreater design productivity compared totraditional approaches. At the same timeyou are not compromising on quality ofresults, since benchmarks show thatMobius-generated circuits match the besthand designs in terms of throughput andresources for both compact and single-cycle pipelined implementations.

Developing embedded systems withMobius is much more like software develop-ment using a high-level language than hard-ware development using assembly-likeHDL. Mobius has a fast transaction-levelsimulator so that the code/test/debuggingcycle is much faster than traditional HDLiterations. The Mobius compiler translatesall test benches into HDL, so for a quickverification simply compare the Mobiussimulation with the HDL simulation usingthe same test vectors. The compiler-generat-ed HDL is assembled from a set of Lego-likeprimitives connected using handshakingchannels. As a result of handshaking, the cir-cuit is robust and correct by construction.

Mobius lets software engineers createefficient and robust hardware/software sys-tems, while hardware engineers becomemuch more productive. With less than 200lines of Mobius source, users have imple-mented FIR filters, FFT transforms, JPEGencoding, and DES and AES encryption,as well as hardware/software systems withPowerPC™ and MicroBlaze™ processors.

Parallelism is the key to obtaining highperformance on FPGAs. Mobius enablesboth compact sequential circuits, latency-intolerant inelastic pipelined circuits, andparallel latency-tolerant elastic pipelined cir-cuits. Using the keywords “seq” and “par,”the Mobius language allows basic blocks torun in sequence or in parallel, respectively.Parallel threads communicate with messagepassing channels, where the keywords “?”

point and floating point. The low-latencymath libraries are supplied as Mobius sourcecode. The fixed point add, sub, and muloperators are zero-cycle combinatorial func-tions. The floating-point add and sub oper-ators take four cycles, the mul operators taketwo cycles, and the div operator is iterative-dependent on the size of the operands.

Infinite Impulse Response FilterAs an example of an embedded control sys-tem, let’s look at how you can use Mobiusto quickly implement an infinite impulseresponse (IIR) filter. I will investigate sever-al different architectures for throughput,resources, and quantization using parame-terized fixed- and floating-point math.

The IIR is commonly found in bothcontrol systems and signal processing. Adiscrete time proportional-integral-deriv-ative (PID) controller and lead-lag con-troller can be expressed as an IIR filter. Inaddition, many common signal process-ing filters such as Elliptic, Butterworth,Chebychev, and Bessel are implementedas IIR filters. Numerically, an IIR com-prises an impulse transfer function H(z)with q poles and p zeros:

This can also be expressed as a differenceequation:

The IIR has several possible implemen-tations. The direct form I is often used byfixed-point IIR filters because a larger sin-gle adder can prevent saturation. The directform II is often used by floating-point IIRfilters because this uses fewer states and theadders are not as sensitive to saturation.The cascade canonical form has the lowestquantization sensitivity, but at the cost ofadditional resources. For example, if p = qand p is even, then the direct form I and II

y(n) bix(n – i) + aky(n – k) = ∑∑P Q

=1k=i 0

H (z) ==

=1 – ∑

∑ Pi

1Qk k

i

a

b

z

z –i

–k

0

and “!” are used to read and write over achannel that waits until both the reader andwriter are ready before proceeding. Messagepassing obviates the need for low-level locks,mutexes, and semaphores. The Mobiuscompiler also identifies common parallelprogramming errors such as illegal parallelread/write or write/write statements.

The Mobius compiler generates synthe-sizeable HDL using Xilinx-recommendedstructures, letting the Xilinx synthesizerefficiently infer circuits. For instance, vari-ables are mapped to flip-flops and arraysare mapped to block RAM. BecauseMobius is a front end to the Xilinx designflow, Mobius supports all current Xilinxtargets, including the PowerPC and DSPunits found on the Virtex-II Pro, Virtex-4,and Virtex-5 device families. The generatedHDL is readable and graphically docu-mented, showing hierarchical control anddataflow relationships. Application notesshow how to create hardware/software sys-tems communicating over the on-chipperipheral and Fast Simplex Link buses.

Handshaking allows you to ignore low-level timing, as handshaking determinesthe control and dataflow. However, it isvery easy to understand the timing modelof Mobius-generated HDL. Every Mobiussignal has a request, acknowledge, and datacomponent where the req/ack bits are usedfor handshaking. For a scalar variableinstantiated as a flip-flop, a read takes zerocycles and a scalar write takes one cycle. Anassignment statement takes as many cyclesas the sum of its right-hand-side (RHS)and left-hand-side (LHS) expressions. Anassignment with scalar RHS and LHSexpressions therefore takes one cycle to exe-cute. Channel communications take anunknown number of cycles (it waits untilboth reader and write are ready). A sequen-tial block of statements takes as manycycles as the sum of its children, and a par-allel block of statements takes the maxi-mum number of cycles of its children.

Mobius has native support for parame-terized signed and unsigned integers, fixed


Mobius lets software engineers create efficient and robust hardware/software systems, while hardware engineers become much more productive.

implementations only require 2p + 1 con-stant multipliers, while the cascade requires5p/2 constant multipliers.

Because of the many trade-offs, it is clearthat numerical experiments are necessary todetermine a suitable implementation.

IIR Implementation in MobiusLet’s select a third-order IIR filter where b0= 0, b1 = 0, b2 = 1, b3 = 0.5, a1 = -1, a2 =0.01, and a3 = 0.12. The difference equationcan be easily written as the multi-threadedMobius program shown in Figure 1.

The Mobius source defines the IIR filteras a procedure and a test bench to exerciseit. Inside iir(), a perpetual loop continuous-ly reads a new reference value u, updates thestates x1, x2, x3, and then calculates andwrites the output. The IIR uses four (fixed-or floating-point) adders and three (fixed-or floating-point) multipliers. Note how the

values of all states are simultaneously updat-ed. The test bench simply sends a constantreference value to the filter input, reads thefilter output, and writes that resulting stepvalue to stdout. By separating the IIR filterfrom the test bench, just the IIR filter canbe synthesized.

Fixed Point or Floating Point?Floating-point math allows for largedynamic range but typically requires con-siderable resources. Using fixed pointinstead can result in a much smaller andfaster design, but with trade-offs in stabili-ty, precision, and range.

Both fixed- and floating-point mathare fully integrated into the Mobius lan-guage, making things easy for you to mixand match various-sized fixed- and float-ing-point operators. In the Mobiussource, the type “t” defines a parameter-

ized fixed- or floating-point size. Bychanging this single definition, the com-piler will automatically use the selectedparameterization of operands and mathoperators in the entire application. Forinstance, the signed fixed-point type defi-nition sfixed(6,8) uses 14 bits, where 6bits describe the whole number and 8 bitsthe fraction. The floating-point type defi-nition float(6,8) uses 15 bits, with 1 signbit, 6 exponent bits, and 8 mantissa bits.


procedure testbench();

type t = sfixed(6,8); (* type t = float(8,24); *) var c1,c2:chan of t;var y:t;

procedure iir(in cu:chan of t; out cy:chan of t);const a2: t = 0.01;const a3: t = 0.12;const b3: t = 0.5;var u,x1,x2,x3:t;seqx1:=0.0; x2:=0.0; x3:=0.0;while true doseqcu ? u; (* read input *)x1,x2,x3 := x2,x3, u-(t(a3*x1)+t(a2*x2)-x3); (* calc states *)cy ! t(b3*x1)+x2; (* write output *)

endend;

par (* testbench *)iir(c1,c2); while true do seq c1 ! 1.0; c2 ? y; write(“ y=”,y) end (* get step response of filter *)

end;

y=0.000000y=1.000000y=2.500000y=4.000000y=5.492188y=6.855469y=8.031250y=9.023438y=9.832031y=10.472656y=10.964844y=11.335938y=11.609375y=11.800781y=11.933594y=12.019531y=12.078125y=12.109375y=12.121094y=12.121094y=12.117188y=12.109375y=12.097656y=12.085938y=12.074219y=12.062500y=12.054688y=12.046875y=12.042969y=12.035156y=12.031250y=12.027344y=12.027344y=12.027344y=12.027344y=12.027344

Figure 1 – Mobius source code and test bench for IIR filter

Figure 2 – HDL simulation of stepresponse running test bench


Parallel or Sequential Architecture?Using parallel execution of multiplethreads, an FPGA can achieve tremendousspeedups over a sequential implementa-tion. Using a pipelined architecturerequires more resources, while a sequentialimplementation can share resources usingtime-multiplexing.

The iir() procedure computes all expres-sions in a maximally parallel manner anddoes not utilize any resource sharing. Youcan also create alternate architectures thatuse pipelined operators and statements forhigher speed, or use resource sharing forsmaller resources and slower performance.

Mobius-Generated VHDL/VerilogUsing the Mobius compiler to generateHDL from the Mobius source (Figure 1),ModelSim (from Mentor Graphics) cansimulate the step response (Figure 2).

I synthesized several Mobius implemen-tations of the IIR filter for a variety of mathtypes and parameterizations (Table 1) usingXilinx ISE™ software version 7.1i (servicepack 4) targeting a Virtex-II Pro XC2VP7FPGA. The Xilinx ISE synthesizer effi-ciently maps the HDL behavioral struc-tures onto combinatorial logic and usesdedicated hardware (for example, multipli-ers) where appropriate. A constraint filewas not used. As you can see, the fixed-point implementations are considerablysmaller and faster than the floating-pointimplementations.

An alternate architecture using pipeliningreduces the cycle time to 1 cycle for the fixed-point and 18 cycles for the floating-point

implementations, with similar resources andclock speeds.

You can also use an alternate architecturewith time-multiplexed resource sharing tomake the design smaller (but slower).Sharing multipliers and adders in the largerfloating-point IIR design results in a designneeding only 1,000 slices and 4 multipliersat 60 MHz, but the IIR now has a latency of48 cycles. This is a resource reduction of 3xand a throughput reduction of 2x.

ConclusionIn this article, I’ve shown how Mobius userscan achieve greater design productivity, asexemplified by the rapid development ofseveral fixed- and floating-point IIR filters.I’ve also shown how you can quickly design,simulate, and synthesize several architecturesusing compact, pipelined, and time-multi-plexed resource sharing to quantitativelyinvestigate resources, throughput, and quan-tization effects.

The Mobius math libraries for fixed-and floating-point math enable you toquickly implement complex control andsignal-processing algorithms. The Mobiussource for the basic IIR filter is about 25lines of code, whereas the generated HDLis about 3,000-8,000 lines, depending onwhether fixed point or floating point isgenerated. The increase in productivityusing Mobius instead of hand-codedHDL is significant.

Using Mobius allows you to rapidlydevelop high-quality solutions. For moreinformation about using Mobius in yournext design, visit www.codetronix.com.

sfixed(6,8) sfixed(8,24) float(6,8) float(8,24)

Resources 66 slices 287 slices 798 slices 2,770 slices66 FFs 135 FFs 625 FFs 1,345 FFs109 LUTs 517 LUTs 1,329 LUTs 4,958 LUTs38 IOBs 74 IOBs 40 IOBs 76 IOBs3 mult18 x 18s 12 mult18 x 18s 3 mult18 x 18s 12 mult18 x 18s

Clock 89 MHz 56 MHz 103 MHz 61 MHz

Cycles 2 2 26 26

Table 1 – Synthesis results of iir()

by Susheel Chandra, Ph.D.President/CEOBinachip, [email protected]

Computationally intense real-time appli-cations such as Voice over IP, video over IP,3G and 4G wireless communications,MP3 players, and JPEG and MPEGencoding/decoding require an integratedhardware/software platform for optimalperformance. Parts of the application run insoftware on a general-purpose processor,while other portions must run on applica-tion-specific hardware to meet performancerequirements. This methodology, common-ly known as hardware/software partitioningor co-design, has caused an increasing num-ber of software applications to be migratedto system-on-a-chip (SOC) platforms.

FPGAs have emerged as the SOC plat-form of choice, particularly in the fast-paced world of embedded computing.FPGAs enable rapid, cost-effective productdevelopment cycles in an environmentwhere target markets are constantly shiftingand standards continuously evolving.Several families of platform FPGAs areavailable from Xilinx. Most of these offerprocessing capabilities, a programmablefabric, memory, peripheral devices, andconnectivity to bring data into and out ofthe FPGA. They provide embedded appli-cation developers with a basic hardwareplatform on which to build end applica-tions, with minimal amounts of time andresources spent on hardware considerations.


Hardware/Software Partitioning from Application BinariesHardware/Software Partitioning from Application BinariesA unique approach to hardware/software co-design empowers embedded application developers.A unique approach to hardware/software co-design empowers embedded application developers.

Source-to-Hardware ToolsSeveral tools on the market allow you totranslate high-level source descriptions inC/C++ or MATLAB into VHDL orVerilog. Most of these tools fall into thecategory of “behavioral synthesis” tools andare designed for hardware designers creat-ing a hardware implementation for a com-plete algorithm. These tools require you tohave some intrinsic knowledge of hardwaredesign and to use only a restricted subset ofthe source language conducive to hardwaretranslation. In some cases they also requireyou to learn new language constructs. Thiscreates a significant learning curve, whichfurther pushes out time to market.

More recently, some companies haveintroduced tools for embedded applicationdevelopers. These tools promote a co-design methodology and allow you to per-form hardware/software partitioning.However, they suffer from some of thesame limitations of behavioral synthesistools – you need to learn hardware-specificconstructs for the tool to generate opti-mized hardware. This imposes a funda-mental barrier for a tool targeted atembedded application developers, who inmost cases are software engineers with noor limited knowledge of hardware design.

Another basic requirement for toolsintended for application developers is thatthey support applications developed inmultiple source languages. An embeddedapplication of even moderate complexitywill comprise parts written in C/C++ (call-ing on a suite of library functions) andother parts written in assembly language(to optimize performance). DSP applica-tions add another dimension of complexityby including MATLAB source code inaddition to C/C++ and assembly language.Clearly, any tool that does not supportmultiple source languages will fall short.

A new unique approach to hardware/soft-ware co-design is to start with the binary ofthe embedded application. It not only over-

stant folding, constant propagation, deadcode elimination, and common sub-expres-sion elimination. For example, considerthis simple snippet of pseudo code:

int a = 30;int b = 9 – a / 5;int c;

c = b * 4;if (c > 10) {

c = c – 10;}return c * (60 / a);

Applying the compiler optimizationtechniques discussed earlier, this pseudocode can be reduced to:

return 4;

A co-design tool operating at the binarylevel can leverage these compiler optimiza-tions to further improve the accuracy ofhardware/software partitioning. Tools thatwork at the source level must either per-form these optimizations internally or leavesome area/performance on the table, result-ing in a sub-optimal co-design.

Legacy ApplicationsFor most popular processors such asPowerPC™, ARM, and TI DSP, a host ofapplications are already out in the field.These applications would benefit from aport to a faster FPGA platform with a co-design implementation. In many of thesesituations the source code is either notavailable or is written in an obsolete lan-guage. If the source code is available, theoriginal developers may no longer be asso-ciated with the project or company; there-fore the legacy knowledge does not exist. Insuch scenarios a tool that works from thebinary can prove to be an invaluable asset.

Designing with BINACHIP-FPGAThe BINACHIP-FPGA tool is designedfor embedded application developers and

comes all of the previously mentioned issuesbut also provides other significant benefits. Anew commercial tool called BINACHIP-FPGA from Binachip, Inc. successfully usesthis approach.

Partitioning from BinariesThe executable code or binary compiled fora specific processor platform is the leastcommon denominator representation of anembedded application written in multiplesource languages. Combined with theinstruction set architecture of the proces-sor, it provides a basis for extremely accu-rate hardware/software partitioning at avery fine grain level. The close-to-hardwarenature of binaries also allows for betterhardware/software partitioning decisions,as illustrated in the following examples.

Bottlenecks Inside Library FunctionsOne obvious consideration forhardware/software partitioning is computa-tional complexity. Moving computationallyintense functions into hardware can resultin a significant performance improvement.Any embedded application typically relieson a suite of library functions for basicmathematical operations such as floating-point multiply/divide, sine, and cosine.Some of these functions are fairly compute-heavy and could be called millions of timesduring the normal execution of an applica-tion, resulting in a bottleneck.

Moving one or more of these functionsinto hardware could easily provide a 10x-50x speedup in execution time, dependingon their complexity and how often they arecalled. Any co-design tool operating at thesource-code level will completely miss thisperformance improvement opportunity.

Architecture-Independent Compiler OptimizationsMost C/C++ or other source languagecompilers are very efficient at architecture-independent optimizations such as con-


DSP applications add another dimension of complexity by including MATLABsource code in addition to C/C++ and assembly language. Clearly, any tool

that does not support multiple source languages will fall short.

allows them to leverage all of the features ofthe FPGA platform, without having anyintrinsic knowledge of hardware design orneeding to learn a new source language.The transformation performed by the toolin going from pure software implementa-tion to a co-design is conceptually illustrat-ed in Figure 1.

An embedded application written in acombination of C/C++, MATLAB, andJava is compiled into a binary executablefor a target platform. The bottleneck isidentified based on profiling data or userinput (see the orange box in Figure 1). Thetool automatically generates hardware forthese functions, which is mapped into the

programmable fabric of the FPGA. Allhardware/software communication inter-faces are also generated by the tool. Calls tothese functions are now computed in hard-ware and control is returned to the softwarewhen the computation is complete. Thetool then outputs a new executable thatruns seamlessly on the target platform.

BINACHIP-FPGA is also capable ofperforming advanced optimizations such asloop unrolling, which allows you to makearea/performance trade-offs. You can alsoexploit fine-grain parallelism by usingadvanced data scheduling techniques likeas-soon-as-possible (ASAP) or as-late-as-possible (ALAP) scheduling.

Empowering Application DevelopersBINACHIP-FPGA empowers embed-ded application developers to createadvanced applications on FPGA plat-forms with minimal intervention orinteraction with hardware designers.Figure 2 shows a typical design flow forthe tool.

After the system architect has made adecision on which platform to use, theapplication developers start writing soft-ware and go through the usual task ofcompiling, debugging, and profiling thecode. BINACHIP-FPGA fits into theflow once the application developershave debugged the basic functionality.After feeding the compiled applicationand profiler data into the tool, you havethe option to manually guide the tooland select which functions should bemapped into the hardware. The toolgenerates the hardware and reconstruct-ed binary with appropriate calls to thehardware.

You can then simulate the resultingimplementation in a hardware/softwareco-design tool using the bit-true testbenches generated by BINACHIP-FPGA. The RTL portion is mapped intothe FPGA fabric using Xilinx SynthesisTechnology (XST) and the ISE™ designenvironment. The application is nowready to run on the target platform.

ConclusionHardware/software co-design tools prom-ise to empower embedded applicationdevelopers and enable the large-scaledeployment of platform FPGAs. Todeliver on that promise, these tools mustassume that users will have minimal or nohardware design knowledge. You shouldnot be required to learn new languageconstructs to use the tools effectively.

BINACHIP-FPGA is the first tool inthe market that uses a unique approachto enable application developers to fullyleverage the FPGA platform and meettheir price, performance, and time-to-market constraints.

For more information on hardware/soft-ware co-design or BINACHIP-FPGA, visitwww.binachip.com.


0010010001010100101010010101010101010100100101001010110101001001010110010100101001010001001000101010010101001010101010101010010010100101

0101001001010010101101010010010101100101001010010100010010001010100101010010101010101

0010010001010100101010010101010101010100100101001010110101001001010110010100101001010001001000101010010101001010101010101010010010100101

001001000101010010101001010101010101010010010100101011010100100101011001010010100101000100100010101001

1001010010100101000100100010101001010100101010101010101001001010010101101010010010101100101001010010100010010001010100101010010101010101

01101010010010101

01010010101010101

Original BinarySoftware Program

Compile Portionto Software

Compile Portionto Software

Compile Portionto Hardware

SoftwarePartitioned on

New Processor

SoftwarePartitioned on

New Processor

SW/HW Interface

HardwareImplementation

on FPGA

HW/SW Interface

C/C++

MATLAB

JAVA

AutomatedSW

Compiliers

CompiledSW Application

10100100100100

BINACHIPHW/SWPartition

Compiled SWRTL VHDL/

Verilog

AutomatedCAD Toolsfor FPGADesign

HardwareImplementation

Processor Memory

I/OFPGA/ASICHardware

MultimediaApplicationC/C++/Java

MATLAB

FPGA Platform

Figure 1 – Transforming a pure software implementation into a co-design

Figure 2 – Design flow using BINACHIP-FPGA

by Tom HillDSP Marketing ManagerXilinx, [email protected]

High-performance DSP platforms, tradi-tionally based on general-purpose DSPprocessors running algorithms developed inC, have been migrating towards the use ofan FPGA pre-processor or coprocessor.Doing so can provide significant perform-ance, power, and cost advantages (Figure 1).

Even with these considerable advan-tages, design teams accustomed to workingon processor-based systems may avoidusing FPGAs because they lack the hard-ware skills necessary to use one as acoprocessor. Unfamiliarity with traditionalhardware design methodologies such asVHDL and Verilog limits or prevents theuse of an FPGA, oftentimes resulting inmore expensive and power-hungry designs.A new group of emerging design toolscalled ESL (electronic system level) promis-es to address this methodology issue, allow-ing processor-based developers to acceleratetheir designs with programmable logicwhile maintaining a common designmethodology for hardware and software.


The Benefits of FPGA CoprocessingThe Benefits of FPGA CoprocessingThe Xilinx ESL Initiative brings the horsepower of an FPGA coprocessor closer to the reach of traditional DSP system designers.The Xilinx ESL Initiative brings the horsepower of an FPGA coprocessor closer to the reach of traditional DSP system designers.

Boosting Performance with FPGA CoprocessingYou can realize significant improvements inthe performance of a DSP system by takingadvantage of the flexibility of the FPGAfabric for operations benefiting from paral-lelism. Common examples include (but arenot limited to) FIR filtering, FFTs, digitaldown conversion, and forward error cor-rection (FEC) blocks.

Xilinx® Virtex™-4 and Virtex-5 archi-tectures provide as many as 512 parallelmultipliers capable of running in excess of500 MHz to provide a peak DSP perform-ance of 256 GMACs. By offloading opera-tions that require high-speed parallelprocessing onto the FPGA and leavingoperations that require high-speed serialprocessing on the DSP, the performanceand cost of the DSP system are optimizedwhile lowering system power requirements.

Lowering Cost with FPGA Embedded ProcessingA DSP hardware system that includes anFPGA coprocessor offers numerous imple-mentation options for the operations con-tained within the C algorithm, such aspartitioning the algorithm between theDSP processor, the FPGA-configurablelogic blocks (CLBs), and the FPGAembedded processor. The Virtex-4 deviceoffers two types of embedded processors:the MicroBlaze™ soft-core processor,often used for system control, and thehigher performance PowerPC™ hard-coreembedded processor. Parallel operationspartitioned into the FPGA fabric can be

C to GatesWhen targeting an FPGA, the term “C togates” refers to a C-synthesis design flowthat creates one of two implementationoptions – direct implementation onto theFPGA fabric as a DSP module or the cre-ation of a hardware accelerator for use withthe MicroBlaze or PowerPC 405 embeddedprocessor (Figure 2).

When an operation lies directly in theDSP datapath, the highest performance isachieved by implementing an operation as aDSP module. This involves synthesizing theC code directly into RTL and then instanti-ating the block into the DSP datapath. Youcan perform this instantiation using tradi-tional HDL design methodologies orthrough system-level design tools such asXilinx System Generator for DSP. Throughdirect instantiation, you can achieve thehighest performance with minimal overhead.

The leading C-synthesis tools are capa-ble of delivering performance approachingthat of hand-coded RTL – but achievingthis requires detailed knowledge of C-syn-thesis tool operation and coding styles.Code modifications and the addition of in-line synthesis instructions for inserting par-allelism and pipeline stages are typicallyrequired to achieve the desired perform-ance. Even with these modifications, how-

used directly in a DSP datapath or config-ured as a hardware accelerator to one ofthese embedded processors.

The challenge facing designers is how topartition DSP system operations into theavailable hardware resources in the most

efficient and cost-effective manner. Howbest to use FPGA embedded processors isnot always obvious, but this hardwareresource can have the greatest impact onlowering overall system cost. FPGA embed-ded processors provide an opportunity toconsolidate all non-critical operations intosoftware running on the embedded proces-sors, minimizing the total amount of hard-ware resources required for the system.

100 MHz 100 kHz 100 kHz

500 MHz 500 MHz

FPGA as Pre-Processor FPGA as Coprocessor

C Program

MicroBlazeEmbeddedProcessor

HardwareAccelerator

DSPProcessorInterface

DSPModule

PPC405EmbeddedProcessor

DSPModule

TI DSPProcessor

Virtex-4FPGA

Figure 1 – DSP hardware platform

Figure 2 – C implementation options for a DSP hardware system



ever, the productivity gains can be signifi-cant. The C-system model remains thegolden source driving the design flow.

An alternative and often simplerapproach is to create a hardware acceleratorfor one of the Xilinx embedded processors.The processor remains the primary targetfor the C routines, with the exception thatperformance-critical operations are pushedto the FPGA logic in the form of a hard-ware accelerator. This provides a more soft-ware-centric design methodology. However,some performance is sacrificed with this

approach. C routines are synthesized toRTL, similar to the DSP module approach,except that the top-level entity is wrappedwith interface logic to allow it to connect toone of the Xilinx embedded processorbuses. This creates a hardware acceleratorthat can be imported into the Xilinx EDKenvironment and called through a software-friendly C function call.

The performance requirements formapping C routines into hardware acceler-ators are typically less aggressive. Here theobjective is to accelerate the performance

beyond that of pure softwarewhile maintaining a software-friendly design flow.Although the coding tech-niques and in-line synthesisinstructions are still available,you can typically achieveyour desired performancegains without their use.

Design Methodology – The Barrier to AdoptionThe effort and breadth ofskill required to correctlypartition and implement acomplex DSP system is for-midable. In 2005, the mar-ket research firm ForwardConcepts conducted a surveyto determine the mostimportant FPGA selectioncriteria for DSP. The pub-lished results, shown in

Figure 3, identify development tools as themost important.

The survey illustrates that the benefitsof a DSP hardware system utilizing anFPGA coprocessor are well understood,but that the current state of developmenttools represents a significant barrier toadoption for traditional DSP designers.

The Xilinx ESL InitiativeESL design tools are pushing digitaldesign abstraction beyond RTL. A subsetof these tool vendors are specificallyfocused on mapping system models devel-oped in C/C++ into DSP hardware sys-tems that include FPGAs and DSPprocessors. Their vision is to make thehardware platform transparent to the soft-ware developer (Figure 4).

Rather than attempting to solve onepiece of this methodology puzzle internal-ly, this year Xilinx launched a partnershipprogram with key providers of ESL toolscalled the ESL Initiative. The focus of thispartnership is to empower designers withsoftware programming skills to be able toeasily implement their ideas in program-mable hardware without having to learntraditional hardware design skills. Thisprogram is designed to accelerate thedevelopment and adoption of world-classdesign methodologies through innovationwithin the ESL community.

ConclusionWhen combined, the collective offeringsfrom Xilinx ESL partners offer a widespectrum of complementary solutionsthat are optimized for a range of applica-tions, platforms, and end users. Xilinxtoo has focused its efforts on complemen-tary technology. For example, AccelDSPSynthesis provides a hardware path foralgorithms developed in floating-pointMATLAB, while System Generator forDSP allows modules developed usingESL designs to be easily combined withXilinx IP and embedded processors. Thequickest path to realizing a programmer-friendly FPGA design flow is through amotivated and innovative set of partners.

For more information about the ESLInitiative, visit www.xilinx.com/esl.

Development Tools

Power Consumption

On-Chip Memory

Competitive Price

Development Boards

I/O

Optimal MIPS/MMACS

Vendor Roadmap

Reference Designs Available

Available Algorithms

Code-Compatible Upgrade

Vendor Support

Third-Party Support

Available Hardware Accelerators

Packaging

Maximum MIPS/MMACS

0 4000 8000 12000 16000

NORMALIZED WEIGHTED RESPONSES

DSP CHIP SELECTION CRITERIA

Product Specification

FPGA Platform

Xilinx ESL Mainstream RTL Partner ESL

MATLABModeling

AccelDSPM to Gates

HDLSimulation

HDLSynthesis

HLL*Modeling

HLL*Synthesis

Programmable – Memory – DSP – Embedded CPU

AccelDSP Focus

• AccelDSP enables MATLAB to gates

• Complements the broader “C”-based ESL flows

Design and VerificationMethodologies

Above RTL

ESL

ESL Capabilities1

2

System Modeling

Module Generation

Algorithm Generation

*HLL = High Level Language

Figure 3 – 2005 Forward Concepts market survey, “DSP Strategies: The Embedded Trend Continues”

Figure 4 – Xilinx ESL Initiative design flows

by Jonathan Saul, [email protected]

C and C++ have been a popular startingpoint for developing hardware and systemsfor many years; the languages are widelyknown, quick to write, and give an exe-cutable specification, which allows for veryfast simulation. C or C++ versions of stan-dard algorithms are widely available, so youcan easily reuse legacy and publicly availablecode. For system-level design, C and C++allow you to describe hardware and softwaredescriptions in a single framework.

Two drawbacks exist, however. First, Cand C++ do not support the description ofsome important hardware concepts, such astiming and concurrency. This has led to thedevelopment of proprietary C-like lan-guages, which aren’t popular because theytie users to a single software supplier.Second, you must manually translate C andC++ to a hardware description languagesuch as VHDL or Verilog for hardwareimplementation. This time-consuming steprequires hardware experts and often intro-duces errors that are difficult to find.

The first problem has been solved bythe development of SystemC, which is nowa widely accepted industry standard thatadds hardware concepts to C++.

The second problem has been solved bythe development of tools like SystemCrafterSC, which allows SystemC descriptions tobe automatically translated to VHDL.

In this article, I’ll explain how to useSystemCrafter SC and SystemC by describ-ing an implementation of the popular DESencryption algorithm.

SystemCSystemC provides an industry-standardmeans of modeling and verifying hardwareand systems using standard software compil-ers. You can download all of the materialrequired to simulate SystemC using a stan-dard C++ compiler, such as Microsoft VisualC++ or GNU GCC, from the SystemCwebsite free of charge (www.systemc.org).

SystemC comprises a set of classlibraries for C++ that describe hardwareconstructs and concepts. This means thatyou can develop cycle-accurate models ofhardware, software, and interfaces for sim-ulation and debugging within your exist-ing C++ development environment.

SystemC allows you to perform the ini-tial design, debugging, and refinementusing the same test benches, which elimi-nates translation errors and allows for fast,easy verification.

And because SystemC uses standardC++, the productivity benefits offered tosoftware engineers for years are now

available to hardware and system design-ers. SystemC is more compact thanVHDL or Verilog; as a result, it is fasterto write and more maintainable and read-able. It can be compiled quickly into anexecutable specification.

SystemCrafter SCSystemC was originally developed as asystem modeling and verification lan-guage, yet it still requires manual transla-tion to a hardware description languageto produce hardware.

SystemCrafter SC automates this processby quickly synthesizing SystemC to RTLVHDL. It will also generate a SystemCdescription of the synthesized circuit, allow-ing you to verify the synthesized code withyour existing test harness.

You can use SystemCrafter SC for:

• Synthesizing SystemC to hardware

• System-level design and co-design

• Custom FPGA co-processing andhardware acceleration

SystemCrafter SC gives you control ofthe critical steps of scheduling (clock cycleallocation) and allocation (hardware reuse).Thus, the results are always predictable, con-trollable, and will match your expectations.

SystemCrafter SC allows you to devel-op, refine, debug, and synthesize hardware


Using SystemC and SystemCrafter to Implement Data EncryptionUsing SystemC and SystemCrafter to Implement Data EncryptionSystemCrafter provides a simple route for high-level FPGA design.SystemCrafter provides a simple route for high-level FPGA design.

and systems within your existing C++ com-piler’s development environment. You canrun fast, executable SystemC specificationsto verify your design. Your can also config-ure the compiler so that SystemCrafter SCwill automatically run when you want togenerate hardware.

The DES AlgorithmThe Data Encryption Standard (DES)algorithm encodes data using a 64-bit key.This same 64-bit key is required to decodethe data at the receiving end. It is a well-proven, highly secure means of transmit-ting sensitive data. DES is particularlysuitable for hardware implementation, as itrequires only simple operations such as bitpermutation (which is particularly expen-sive in software), exclusive-OR, and tablelook-up operations.

A DES implementation comprises twostages. During the first stage, 16 interme-diate values are pre-computed based on theinitial key. These 16 values are fixed for aparticular key value and can be reused formany blocks of data. Calculation of thekey values involves repeated shifts andreordering of bits.

The second computation stage involves16 iterations of a circuit using one of thepre-computed key values. Encryption isbased on 64-bit blocks of data with 64 bitsof input data encoded for each group of 16iterations, resulting in 64 bits of outputdata. Each iteration involves permutations,exclusive-OR operations, and the look-upof eight 4-bit values in 8 look-up tables.

Decryption works exactly the same wayas the second computation stage, but withthe 16 key values from the first stage usedin reverse order.

For a full description of the DES algo-rithm, go to http://csrc.nist.gov/publications/fips/fips46-3/fips46-3.pdf.

Design FlowThe design flow using SystemC andSystemCrafter is shown in Figure 1. Animportant benefit of this design flow isthat you can carry out the development ofthe initial SystemC description, partition-ing, and system- and gate-level simulationall in the same framework. The designer

are made as “custom build steps,” whichallows simulation and synthesis from with-in the Visual C++ design environment. Youcan also use the SystemCrafter GUI tomanage projects.

Implementing look-up tables asCoreGen modules shows how complexthird-party IP blocks are usable in aSystemCrafter design. SystemCrafter treatsSystemC modules with missing methoddefinitions as black boxes, allowing you touse IP blocks written in other languages.For simulation, the designer wrote a modelof the look-up tables in SystemC for useduring simulation.

Simulation and SynthesisThe SystemCrafter flow allows you to sim-ulate the design at a number of levels.

In the DES example, the designer wrotea test harness, which fed standard DES testpatterns through the hardware engine anddisplayed the results on the console. Hespecified two configurations in Visual C++:system level and gate level.

The system-level configuration was first.It compiles the original SystemC design to

implementing the DES algorithm used hisnormal design environment, Microsoft’sVisual C++. The target platform was theZestSC1, an FPGA board containing aXilinx® Spartan™-3 FPGA and someSRAM communicating with a host PCthrough a USB interface.

Hardware/Software PartitioningThe first step is to write an initial system-level description. It may be appropriate todecide on hardware/software partitioningat this point, or you can defer this deci-sion until you have a working system-level description.

In the DES example, the hardware/soft-ware partitioning decision was easy andintuitive. The core of the algorithminvolves high-bandwidth computationusing operators unsuitable for softwareimplementation, and thus will be imple-mented on the FPGA.

As I stated previously, the designerwrote a hardware description of the DEScore in SystemC using Microsoft’s VisualC++ design environment. Pseudo code isshown in Figure 2. Calls to SystemCrafter


System Description

SystemC Partial

Implementation

SystemCrafter SC

Gate-Level

Description

SystemCrafter SC

VHDL Description

VHDL Synthesis Tool

C++ Test Bench

To Software and HDL

Implementation Paths

VHDL Test Bench

Netlist

Behavioral

Verification

Gate-Level

Verification

VHDL Behavioral

Verification

VHDL Gate-Level

Verification

Place and Route

FPGA

Implementation

Figure 1 – Design flow

produce a simulation model at the behav-ioral level. This model is an executable pro-gram that can produce a fast behavioral-levelsimulation of the DES engine.

Once the behavioral model producedthe desired results, it was time for the gate-level configuration. It automatically callsSystemCrafter synthesis as part of the buildprocess, which produces two descriptionsof the synthesized circuit: a SystemCdescription and a VHDL description.

As part of the build process, the gate-level SystemC description is compiled intoa gate-level simulation model, which canproduce a fast gate-level simulation of theDES engine. This verifies that the synthesisprocess is correct.

The designer used Mentor Graphics’sModelSim to simulate the VHDLdescription.

ImplementationThe VHDL produced by SystemCrafteris the core of the implementation.

Support modules supplied with the boardhelped with the interface between the PCand ZestSC1 FPGA board. A VHDLmodule simplified the connection to theUSB bus. The designer wrote a smallpiece of VHDL to connect the 16-bitUSB bus to the 64-bit DES input andoutput ports.

The complete DES project was thencompiled using standard Xilinx tools(XST for the VHDL and COREGenerator™ software for the ROMs, fol-lowed by place and route) to generate anFPGA configuration file.

The device driver and C library con-tained in the ZestSC1 support packagewere integral in developing a simple GUIthat configures the board with the FPGAconfiguration file. It then loads an image,sends it to ZestSC1 over the USB bus forencryption and then decryption, and dis-plays the resulting images.

Figure 3 shows a sample output fromthe GUI.

DiscussionThe DES application was written by aVHDL expert who was new to SystemC.The complete design, including both hard-ware and software, went from concept tofinal version in only three days, including anumber of explorations of different designimplementations.

The SystemCrafter flow was flexibleenough so that the designer could use amixture of SystemC for the algorithmicpart of the design, CORE Generatorblocks for the look-up tables, and VHDLfor low-level interfacing.

The flow allowed him to use his existingVisual C++ design environment for codedevelopment, simulation, and synthesis.The SystemC description was more con-cise than an equivalent VHDL designwould have been, and this level of descrip-tion allowed the development to focus onalgorithmic issues.

ConclusionSystemCrafter SC offers a simple route fromsystem-level design down to Xilinx FPGAs.

It is suitable for programmers, scien-tists, systems engineers, and hardwareengineers. It enables you to view develop-ing hardware as a higher-level activity thanwriting HDL and allows you to focus onthe algorithm rather than on the details ofthe implementation. This can improvetime to market, reduce design risk, andallow you to design complex systems with-out learning HDLs or electronics.

Both SystemCrafter SC and the DESimplementation, including working sourcefiles and a more detailed description, are avail-able as downloads from the SystemCrafterwebsite, www.systemcrafter.com.


loop foreverif KeySet = 1

DataInBusy = 1K(0) = Permute(KeyIn)loop for i = 1 to 16

K(i) = K(i-1) << shift amountK(i) = Permute(K(i-1))

end loopDataInBusy = 0

else if DataInWE = 1DataInBusy = 1D(0) = Permute(DataIn)loop for i = 1 to 16

E = Permute(D(i-1))A = E xor K(i)S = LUT(A)P = Permute(S)D(i) = Concat(Range(D(i-1), 31, 0), P xor Range(D(i-1), 63, 32))

end loopwait for DataOutBusy = 0DataOut = D(16)DataOutWE = 1 for one cycleDataOutBusy = 0

end ifend loop

Figure 3 – DES image encryption GUI

Figure 2 – Pseudo code for the DES implementation

by Allan CantlePresident and [email protected]

The high-performance computing (HPC)markets have taken a serious interest in thepotential of FPGA computing to help solvecomputationally intensive problems whilesimultaneously saving space and reducingpower consumption. However, to date fewcapable products and tools are tailored tosuit HPC industry requirements.

In the last 15 years, the industry has alsomoved away from customized, massivelyparallel computing platforms (like thoseoffered by Cray and SGI) to clusters ofindustry-standard servers. The majority ofHPC solutions today are based on thisclustered approach.

To serve the HPC market, Nallatechhas introduced a family of scalable cluster-optimized FPGA HPC products, allowingyou to either upgrade your existing HPCcluster environments or build new clusterswith commercial-off-the-shelf FPGA com-puting technology.


Scalable Cluster-Based FPGA HPC System SolutionsScalable Cluster-Based FPGA HPC System SolutionsNallatech introduces an HPC family of FPGA system solutions optimized for clustering,with performance capability rising to multi-teraflops in a single 19-inch rack.Nallatech introduces an HPC family of FPGA system solutions optimized for clustering,with performance capability rising to multi-teraflops in a single 19-inch rack.

Nallatech has developed a tool flow thatalso allows HPC users to continue to use astandard C-based development environ-ment to target FPGA technology. Oncefamiliar with this basic tool flow, you canthen learn more about FPGA computing,leveraging more advanced features to fullyexploit the technology.

A Scalable FPGA HPC Computing PlatformThe traditional approach to HPC comput-ing comprises a large array of microproces-sors connected together in a symmetricalfashion using the highest bandwidth andlowest latency communications infrastruc-ture possible. This approach provides ahighly scalable structure that is relativelyeasy to manage.

Although easily scalable and manageable,these regular computing structures are rarelythe best architecture to solve different algo-rithmic problems. Algorithms have to be tai-lored to fit the given computing platform.

With FPGAs, by contrast, you can builda custom computer around any specificalgorithmic problem. This approach yieldssignificant benefits in performance, power,and physical size and has been widelyadopted in the embedded computing com-munity. However, embedded computers aretypically designed with one application inmind – and therefore a custom computer iscrafted for each application.

Nallatech’s FPGA HPC architecturesare designed to merge the best of bothworlds from the embedded and HPCcomputing industries. Our FPGA HPCarchitecture is symmetrical with regularcommunications between FPGA proces-sors, but also has an inherent infrastruc-ture that allows you to handcraft customcomputing architectures around givenalgorithmic problems.

Recognizing the popularity of cluster-ing, Nallatech selected and targeted thearchitecture for IBM BladeCenter comput-ing platforms. With this approach, you canmix and match traditional microprocessorblades and FPGA computing blades in aproportion that suits the amount of codetargeted at the FPGA.

Figure 1 shows the four basic types ofconfigurations possible with Nallatech’s

Creating Custom Virtual ProcessorsNallatech’s approach to FPGA HPC com-puting provides an extremely flexible archi-tecture that enables you to createexceptionally high-performance computingsolutions with a highly efficient use ofpower and space. However, without theappropriate development tools, using thistype of platform would be virtually impos-sible. As mentioned previously, any physi-cal instantiation of Nallatech’s FPGA

FPGA HPC computing platforms. Thewide scaling capability allows you to scalethe FPGA computing resource in a waythat is identical to standard cluster scaling.The deep and wide scaling capability ofthis architecture additionally permits scal-ing to more complex custom computingalgorithms by treating multiple FPGAs asa single, large FPGA.

A 7U IBM BladeCenter chassis with aprotruding Nallatech FPGA computingblade is shown in Figure 2. You can mixthis FPGA blade with host processorblades in configurations similar to thosedetailed in Figure 1. Each blade is con-figurable with between one and sevenFPGA processors providing the deepscaling. Several blades installed into oneor more BladeCenter chassis provide thewide scaling necessary for more complexHPC applications.

Floating-point implementations ofcode in the fields of seismic processing andcomputational fluid dynamics havedemonstrated performance improvementsmore than 10 times that of standardprocessors, while achieving power savingswell in excess of 20 times.

Processor

Narrow and Shallow

FPGA Configuration

Narrow and Deep

FPGA Configuration

Wide and Shallow FPGA Configuration

Wide and Deep FPGA Configuration

Processor

Processor

FPGA

Processor

Processor

Processor

FPGA

FPGA

FPGA

Processor

Processor

Processor

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

Processor

Processor

Processor

FPGA

FPGA

FPGA

FPGA

FPGA

FPGA

Figure 2 – IBM BladeCenter chassis with protruding Nallatech FPGA blade

Figure 1 – Four types of processor-to-FPGA cluster configurations



computing platform can effectively betreated as either:

• A number of physical FPGA processorsthat exist in the computer connectedtogether

• One massive FPGA processor combin-ing multiple FPGAs and creating onecustom compute engine with theseprocessor chips

• Thousands of tiny processors withmany processors existing in each FPGA

Whenever a new design commences, itis counterproductive to think of the prob-lem in terms of the physical computingarchitecture at hand. Instead, consider howyou can construct a custom computer tosolve the specific compute problem.Effectively, you can develop a virtualprocessor that is explicitly designed to solveyour specific algorithmic problem.

There are many ways to create a virtualprocessor for FPGAs, including:

• Using standard language compilers toFPGAs such as C and Fortran, pullingtogether various computing libraryfunctions available for the FPGA

• Using high-level algorithmic tools suchas Simulink from The MathWorks’stogether with Xilinx® System Generator

• Designing a dedicated processor fromlow-level hardware languages such asVHDL

Most HPC software engineers prefer thelanguage-based approach for obvious rea-sons, so language-based compilers are nowmaturing and becoming a viable approachto creating virtual processors.

As a high-performance FPGA comput-ing systems provider, Nallatech has takenan approach to try to support as manycompiler tools as possible. Today, you candevelop virtual processors using threeapproaches:

• Nallatech’s DIME-C enabling compiler

• Impulse Technologies’s Impulse Ccompiler

• Xilinx System Generator for TheMathWorks’s Simulink environment

Nallatech’s DIME-C compiler is purelybased on a subset of ANSI-C, with norequirement for pragmas or other specificactions. This means that you can design andsimulate code with the tools with whichyou are familiar. Once the C code is func-tioning correctly, you have the freedom tocompile directly through to a bitstream andrun the code on a chosen FPGA processor.Nallatech tools provide all of the APIs nec-essary to access the FPGA through simplefunction calls from the host processor.

Maximizing Data Throughput and Integrating Virtual ProcessorsIn many cases, creating a virtual FPGAprocessor using a language-based softwareenvironment and implementing the result-ing executable on the allocated FPGAcoprocessor can substantially improve theperformance of the chosen application.

However, this approach does not neces-sarily yield the most optimum implemen-tation. You can often benefit from splittingyour algorithmic problem into severalindependent processes. In these circum-stances it is often worth it to create a virtu-al processor for each process, as this willallow you to decouple one processor fromthe other and give you more architecturalimplementation flexibility.

DIMEtalk allows you to effectivelyplace many virtual processors across theFPGA computing resources in an HPC

platform while minimizing the communi-cations overhead between processing ele-ments. DIMEtalk also allows you toeffectively maximize the four levels ofmemory hierarchy that Nallatech’s FPGAHPC platforms provide.

Figure 3 shows an example of a complexcompute engine implemented across multi-ple FPGAs in the HPC platform. There areseveral virtual processors constructed fromdifferent compiler tools; these are connect-ed together through the DIMEtalk packet-switched communications network.

ConclusionFPGA computing is deliverable on familiarcluster-based computing platforms; plus,with an ANSI-standard C compiler, youcan achieve substantial performance gains.Looking forward, FPGA technology willlikely advance HPC performance at a fargreater rate than traditional processor-based technologies. For traditional softwareengineers, compiler tools will also becomefar more advanced and easy to use.

Nallatech is providing a special bundleof its entry-level HPC card with completeenabling tools software and a DIME-C toFPGA run time tool flow. Please visitwww.nallatech.com/hpc for more details orto place an order.

For additional information, contactNallatech at [email protected] or [email protected].

Raid Array

FPGA

DIME-C

Library

Function

Library

Function

Library

Function

= Virtual Processor

= Low overhead, variable bandwidth

network/packet-based or direct-coupled

gigabit serial or parallel buses

Local Block

Memory

DDR2

SRAM

Soft

Processor

DDR2

SDRAM

DDR2

SRAM

DIME-C

VHDL

PowerPC

FPGA

GbE

LVDS I/OHOST

PROCESSOR

Linux / Windows / VxWorks

PPC / X86 / Itanium

FPGA

FPGA FPGA FPGA

FPGA FPGA

For DSP

For DSP

Figure 3 – Example DIMEtalk network implementing many different virtual processors

by Stephen Simon Sr. Director of MarketingPoseidon Design Systems [email protected]

Bill SalefskiVP of Technical MarketingPoseidon Design [email protected]

The number of embedded processor archi-tectures implemented in today’s image-pro-cessing systems is growing. Designers valuethe ability of processors to perform complexdecision and combinational processes moreefficiently than hardware logic. As a result,these designers are moving to a processor-based design that combines the advantagesof processors with the inherently fast pro-cessing rates of hardware accelerators.

Image-processing applications are richwith signal-processing requirements. FFTs,FIR filtering, discrete cosine transforms(DCT), and edge detection are all commonfunctions found in image-processing algo-rithms. These functions are computationally

intensive and can severely impact the per-formance of processor-based solutions.You can design these functions in hardwareto elevate performance, but this requires alarge amount of manpower and expertise,which may not be available in today’sshrinking product design cycles.

Designers seem to be caught betweenthe poor performance of processor-basedsolutions and the long design times of dis-crete hardware implementations. Whatthey need is a way to bridge the gap anddeliver a system solution that meets per-formance requirements without having todesign discrete solutions with RTL.

Triton System-Level ToolsPoseidon Design Systems offers a way outof this dilemma. The Triton Tool Suitegreatly simplifies the process of creatinghigh-performance systems without thetime-consuming task of designing hardwarewith RTL. The suite comprises two tools,Tuner and Builder. Tuner is a SystemC sim-ulation, analysis, and verification environ-

ment; Builder is a hardware accelerator gen-eration tool. Triton tools supportPowerPC™ and Xilinx® MicroBlaze™processor-based platforms.

Triton TunerPerforming proper system analysis anddesign is critical when assembling anyhigh-performance imaging system. Forexample, you must evaluate many architec-tural issues that can critically impact theperformance of the entire system. Becausethese systems require complicated softwareand hardware architectures, you must alsoevaluate the performance of proposedarchitectures early in the design processrather than waiting for the working systemto meet your desired performance goals.

A number of factors determine systemperformance beyond just the availableprocessor MIPS. Some of these factors areproper cache management, shared-bus con-gestion, and system topology. WithPoseidon’s Tuner product (Figure 1), youcan co-simulate your hardware and software


Accelerate Image Applications with Poseidon ESL ToolsTriton system-level design tools simplify the creation of high-speed solutions for imaging applications.

systems in the SystemC environment usingtransaction-level models (TLMs). You canthen evaluate system operation and per-formance and make critical design decisionsto develop a robust, efficient solution.

Tuner uses transaction-level modelingto provide the proper level of abstrac-tion and fast simulation speeds. Thisenvironment also provides data visu-alization capability and eventcoherency, giving you a powerful toolin analyzing and verifying the per-formance of your target architecture.Tuner outputs data not just onprocessor performance, but also oncache performance, traffic congestionon shared buses, and efficiency of thememory hierarchy. Tuner links hard-ware events such as cache misses tothe line of code and data accessed atthe time; you can gain insight intothe software profile and how effec-tively the system architecture sup-ports the software.

Triton BuilderHardware/software partitioning alsoimpacts architectural decisions.Determining what portions of thealgorithms need to be in hardware toachieve your desired results is not asimple task. In a traditional designflow, a long process follows the parti-tioning decision to build the hard-ware and test the resultant system.Changes at this point in the partitionare very costly and will greatlyimpact the delivery schedule.

With Poseidon’s Builder tool(Figure 2), you can quickly create ahardware/software partition and thenevaluate the performance of the pro-posed architecture. Builder allowsyou to easily move critical portions ofthe selected algorithm to a customhardware accelerator. Builder inputsalgorithms written in ANSI C andcreates a complete system solution.No special programming methodolo-gy or custom language is required.The tool generates the RTL of theaccelerator, memory compilerscripts, test bench, modified C algo-

created into Tuner and quickly evaluatethe performance of the selected partitionand resultant hardware. This provides youwith almost instantaneous feedbackabout the effect of your partitioning deci-sions. The Tuner simulation environment

displays system details that evaluationboard platforms cannot. This tightloop enables you to quickly test dif-ferent approaches and configurations,making key trade-offs in performance,power, and cost.

Builder supports different architec-tures to support different applicationdata requirements. For large datarequirements, Builder instantiates abus-based solution (Figure 3) (such asthe PowerPC PLB) based on directmemory access (DMA) engines to rap-idly move data into and out of thehardware. For smaller data sets or lowerlatency applications, Builder provides atightly coupled solution that interfacesto the processor through the auxiliaryprocessor unit (PowerPC) or FastSimplex Link (MicroBlaze processor)interfaces. For very large data require-ments, Builder supports the Xilinxmulti-port memory controller(MPMC). This provides for large exter-nal fast memories to hold the data andovercome bandwidth limitations onexisting processor buses.

With all of these options, Builderautomates the process of programmingthe interfaces, communicating with theaccelerator, and scheduling the data.With these two tools you can make crit-ical design trade-offs and quickly realizea high-performance imaging systemthat meets your design requirements.

Radix 4 FFT ExampleLet’s look at a typical image-processingapplication using a fast Fourier trans-form. We chose an FFT for its popular-ity and because most designers arefamiliar with it. With Triton tools, youcan not only quickly implement com-mon functions such as FFTs; you canalso implement the proprietary algo-rithms companies develop to differen-tiate their designs.

rithm, software drivers, and TLM of theresultant accelerator hardware. Buildergenerates all of the parts required to imple-ment the accelerated solution.

To complete the design loop, you canimport the accelerator TLM that Builder


Local Memory

ComputeCore

ControllerDMA

Bus Interface

PLB or FSL Bus

Figure 1 – The Tuner main screen displays software profiling data.

Figure 2 – The Builder main screen displays accelerator parameters.

Figure 3 – Bus-based accelerator

The specific FFT algorithm in ourexample is a Cooley-Tukey Radix 4 decima-tion in frequency written in standard ANSIC. The algorithm runs on a 300 MHzPowerPC on a Virtex™-II platform. Thealgorithm executes in 1.3 ms. This gives amaximum frame rate of 769 frames per sec-ond when 100% of the host processor isdedicated to the FFT task. Processing theselected algorithm in place makes efficientuse of the memory resources. The softwarecomprises five main processing for loops,one for each pass of the FFT algorithm.

The first task is to profile the FFT algo-rithm to determine the best candidates formoving to FPGA fabric. Tuner reads thearchitectural description for the XilinxPlatform Studio (XPS) and simulates thealgorithm. The five loops performing theRadix 4 butterflies consume between 15%and 23% of the cycles used in the entirealgorithm. These are all good candidates foracceleration and C-to-RTL conversion intodedicated FPGA logic.

Using Builder, you can determine howmuch of the algorithm to move to hardwareto accelerate the application. If the designneeds a 50% improvement in processingspeed, it would be very inefficient to imple-ment an entire FFT in hardware. In thisexample, by selecting all five loops for par-titioning, more than 99% of the processorcycles were moved into the hardware accel-erator. It would be just as easy if yourequired less performance to select anynumber of loops to match the desired accel-eration of the system, thus saving space inthe fabric for other uses.

Along with the ability to partition loopsinto hardware, Builder gives you the flexi-bility to easily move functions into hard-ware as well. With this flexibility, you havethe freedom to match the partition to therequirements and structure of the specificapplication.

Builder also provides a C-lint tool thatcan help you determine which existing codeis synthesizable. If any non-synthesizablestructures exist within the algorithm, the

tool flags that section of code and reportsthe condition that makes it unrealizableinto hardware. The tool also suggestsenhancements and modifications thatmight be useful in implementing more effi-cient algorithms in hardware.

The partitioning process is as simple asselecting the loops or function that youwould like moved into the accelerator – thetool performs the rest. If you wish, you cancontrol the optimizations and settings tooptimize the solution, making trade-offs tobest match the desired system.

Builder also takes care of integrating thepartitioned code back into software by gen-erating a software driver. The tool deter-mines the scheduling of the variables forthe resultant hardware and ensures that thedata will be available when the hardwarerequires it. Builder determines memorysizes and memory banking requirements toefficiently process the data in the centralcompute core. Builder then creates thememory compilation script for theLogiCORE™ system to build the opti-mized memory structure.

The bus-based accelerator is appropriatein this example because of the moderateamount of data required. This solution hasits own DMA, which enables the accelera-tor to access the data in system memory andrun autonomously to the processor. Builderautomatically schedules and programs theaccelerator to transfer data into it using theDMA, and similarly transfer the resultsback out into system memory.

Because of the tight integration with theTuner environment, you do not have toguess the effect of the partitioning decisionsand architecture options created usingBuilder. With each new option you canquickly generate the TLM and simulate theproposed solution using Tuner. This givesyou instant feedback as to how fast theresultant solution is as well as how it inte-grates into the existing system architecture.When the software and hardware meet thedesired performance, you can automaticallygenerate the pcores for the hardware accel-

erator to import back into XPS. All of therequired files and scripts are generated,allowing you to continue on with the exist-ing design flow.

By using this automated technology, youcan easily create and verify speedups of asmuch as 16x. This pushes the maximumframe rate of the FFT application to 12,000frames per second for each channel of FFTimplemented, with the processor loaddecreasing to less than 5%.

You can see with our FFT example theease of use of Triton ESL tools and how withlittle effort you can generate high-perform-ance systems. This example took only twoman weeks of effort and without any manu-al modifications to the resultant system. Ifyou require additional processing capability,it is a straightforward process to modify thesoftware to support a streaming architecture.Creating concurrency between passes or cre-ating multiple channels with separate accel-erators, you can quickly achieve an overallacceleration of 48x to 80x.

ConclusionThe Triton Tool Suite is a powerful tool forthe development of image-processing archi-tectures. The Triton Tool Suite enables youto efficiently perform and verify architectur-al development, hardware/software parti-tioning, and the automated hardwaregeneration of key digital signal processingalgorithms. Tuner and Builder not only pro-vide large savings in your design efforts; theyallow you to create more efficient high-per-formance and scalable solutions. The toolsare highly automated, yet give you the flexi-bility to optimize your solution and meetyour system design requirements.

The system also automates the interfacebetween the Triton Tool Suite and theXilinx XPS system. This solution is opti-mized for computationally intensive algo-rithms found in audio, video, VoIP,imaging, wireless, and security applications.For more information, please visit our web-site at www.poseidon-systems.com and requesta free 30-day evaluation.


Along with the ability to partition loops into hardware, Builder gives you the flexibility to easily move functions into hardware as well.

by James Hrica Systems Architect Celoxica [email protected]

Jeff JusselGM AmericasCeloxica [email protected]

Chris SullivanStrategic Marketing DirectorCeloxica [email protected]

The microprocessor has had more impacton electronics and our society than anyother invention since the integrated circuit.The CPU’s programmability using simplesets of instructions makes the capabilitiesof silicon available to a broad range ofapplications. Thanks to the scalability ofsilicon technologies, microprocessors havecontinuously grown in performance for thelast 30 years.

But what if your processor does nothave enough performance to power yourapplication? Adding more CPUs may notfit power or cost budgets – and more thanlikely won’t provide the needed accelerationanyway. Adding custom hardware accelera-tors as co-processors adds performance atlower power. But unlike processors, customhardware is not easily programmable.Hardware design requires special expertise,months of design time, and costly NREdevelopment charges.


Accelerating System PerformanceUsing ESL Design Tools and FPGAsESL design tools provide the keys for opening more applications to algorithm acceleration.

Enter the combination of high-per-formance Xilinx® FPGAs and electronicsystem-level (ESL) design tools, expresslytuned for FPGA design. By taking advan-tage of the parallelism of hardware, certainalgorithms can gain faster performance – atlower clock speeds using less power – thanwhat is possible in serial processor opera-tions. Xilinx FPGAs provide a reconfig-urable option for hardware acceleration,while the addition of ESL design flowsfrom Celoxica gives developers access tothese devices using a software design flow.Software-based design flows are now ableto directly program Xilinx FPGA devices ascustom co-processors (see Figure 1).

Will FPGA Acceleration Work for All Algorithms?When a general-purpose CPU or DSPalone can’t deliver the performance, severaloptions are available (Figure 2). However,not all algorithms lend themselves well toall types of acceleration. The first step inthe process is to determine whether theapplication will benefit from the paral-lelism of hardware.

Amdahl’s law, developed by computerarchitect Gene Amdahl in 1967, asserts apessimistic view of the acceleration possible

Using custom hardware accelerators, theN in Amdahl’s equation becomes the speedupmultiplier for accelerated algorithms. In hard-ware, this multiplier can be orders of magni-tude higher; the power utilization and costs ofFPGA accelerators are generally much lower.Still, Amdahl’s law would still seem to limitthe potential theoretical total accelerationtimes. Fortunately, in practice Amdahl’s lawhas proven too pessimistic; some suggest thatthe proportional throughput of parallel hard-ware increases with additional processingpower, making the serial portion smaller incomparison. For certain systems, total acceler-ations of one, two, or even three orders ofmagnitude are possible.

from parallelism. Specifically applied tosystems using multiple CPUs, Amdahl’slaw states that the acceleration possible islimited to the proportion of the accelerat-ing algorithm’s total processing time. If thealgorithm takes up p% of the total process-ing time and is spread across N processors,then the potential speedup can be writtenas 1/[(1-p) + p/N]. For example, assumingthat 20% of an algorithm is spread acrossfour processors, the total processing time isat best reduced to 85% of the original.Given the additional cost and powerrequired by three extra processors, a mere1.2x speedup may not provide a bigenough return.

Microprocessor• Serial Execution• High Power, Cost• Software Programmable

Hardware/Software Accelerated System• Optimized Performance• Optimized Power, Cost• Software Programming Flexibility

Application-Specfic Hardware• Parallel Performance• Low Power, Cost• Long, Expensive Design Cycle

Software-to-SiliconProgramming Path

Algorithm Design (C Code)

C Debug, Optimization, and Compilation

Hardware/Software Partitioning

Hand Translation to RTL

RTL Design Flow

Softw

are Processor P

rogramm

ing Path

Conventional H

ardware D

esign Path

.obj EDIF

Software Compiled for Processor Celoxicaʼs Software-Compiled Silicon Traditional Hardware Methodologies

Configurable μP/FPGA/Programmable SOC/DSP

Processor/DSP Xilinx FPGA Full Custom Hardware

Sequential Execution

Very Flexible

Power-Inefficient

Inexpensive

Parallel Execution

Flexible

Power-Efficient

Relatively Inexpensive

Parallel Execution

Inflexible

Very Power-Efficient

Very Expensive(Very High Volumes Only)

Very Difficult to Design* C-Based Design for Software and Hardware

Figure 2 – System acceleration possibilities

Figure 1 – Software design flows directly program FPGAs.



What Portions of the Algorithm Should Go in the FPGA? To accelerate a software application usingFPGAs, you must partition the designbetween those tasks that will remain on theCPU and those that will run in hardware.The first step in this process is to profile theperformance of the algorithm runningentirely on the CPU.

You can gain some insight into wherethe application spends most of its run timeusing standard software profiling tools. Theprofiler report points to the functions, oreven lines of code, that are performancebottlenecks. Free profiling tools such asgprof for the GNU development environ-ment or Compuware’s DevPartner Profilerfor Visual Studio are readily available.

You should analyze the application codeprofile around areas of congestion to deter-mine opportunities for acceleration. Youshould also look for sub-processes that canbenefit from the parallelism of hardware. Toaccelerate the system, you move portions ofyour design into hardware connected to theCPU, such as an FPGA situated on a PCI,HyperTransport, or PCI Express interfaceboard. PowerPC™ or MicroBlaze™ embed-ded processors provide another possibleapproach, putting the reconfigurable fabricand processor within the same device.

It is important during the analysisprocess not to overlook the issues of datatransfer and memory bandwidth. Withoutcareful consideration, data transfer over-head can eat into gains made in processingtime. Often you may need to expand thescope of the custom hardware implementa-tion to minimize the amount of data to betransferred to or from the FPGA. Processesthat expand data going into the acceleratedblock (such as padding or windowing) orprocesses that contract data coming out ofthat block (such as down-sampling) shouldall be included in the FPGA design toreduce communication overhead.

Where Is Parallelism in the Algorithm?Hardware accelerates systems by turninglarge serial software functions and perform-ing them in parallel over many smallerprocesses. This results in a highly opti-mized, very specific set of hardware func-

tions. ESL tools are an important part ofthis process, allowing you to insert paral-lelism from software descriptions, usingFPGAs like custom co-processors.

You can exploit parallelism at many lev-els to move functionality from software intocustom hardware. Fine-grained parallelismis achieved by executing many independent,simple statements simultaneously (variableassignments, incrementing counters, basicarithmetic, and others). The pipelining ofsequentially dependent statements is anoth-er form of fine-grain parallelism, accom-plished by arranging individual stages inparallel such that each stage feeds the inputof the next stage downstream.

For example, by pipelining the inside ofa code loop, including the loop controllogic, it may be possible to reduce itera-tions to only one clock cycle. After primingthe pipeline with data, the loop could bemade to execute in the number of clockcycles equal to the number of loop itera-tions (plus the relatively modest latency).This type of optimization often providesthe most benefits when accelerating analgorithm in hardware.

Coarse-grained parallelism can also pro-vide acceleration. Larger independent sub-processes like functions or large loopedblocks, which run sequentially on theprocessor, can run simultaneously in cus-tom hardware. Similarly, unrolling loopseither fully or partially can add parallelism.In unrolling, the processing logic insideloops is replicated multiple times so thateach instance works in parallel on differentdata to reduce the number of loop itera-tions. Pipelining large sequentially depend-ent processes can further remove cycles atthe cost of some latency.

Another important form of coarse-grained parallelism occurs when the soft-ware and hardware operate in tandem. Forexample, a system accelerated by transfer-ring frames of data to custom hardware forco-processing might be structured so thatwhile the hardware processes one frame,the software is accomplishing any requisitepost-processing of the previous frame, orpre-processing the next frame. This type ofcoordination and efficiency is made possi-ble by the common development environ-

ments for both the hardware and softwareprovided by ESL design.

Parallelism can also optimize the memo-ry bandwidth in a custom hardware accel-erator. For example, the addition operatorfor two vectors may be written as a loop,which reads an element of each vector frommemory, adds the operands, and writes theresulting sum vector to memory. Iterationsof this operation require three accesses tomemory, with each access taking at leastone cycle to accomplish.

Fortunately, modern Xilinx FPGAsincorporate embedded memory featuresthat can help. By storing each of the threevectors in a separate embedded memorystructure, you can fully pipeline the actionof the summing loop so that iterations takeonly one cycle. This simple optimization isone example of how you can maximize sys-tem acceleration in hardware.

How Do Accelerators Handle Floating Point?The use of floating-point mathematics isoften the most important issue to resolvewhen creating custom hardware to acceleratea software application. Many software appli-cations make liberal use of the high-perform-ance floating-point capabilities of modernCPUs, whether the core algorithms require itor not. Although floating-point arithmeticoperations can be implemented in PLDhardware, they tend to require a lot ofresources. Generally, when facing floating-point acceleration, it is best to either leavethose operations in the CPU portion of thedesign or change those operations to fixedpoint. Fortunately, you can implement manyalgorithms effectively using fixed-pointmathematics, and there are pre-built floating-point modules for those algorithms that mustimplement floating point in hardware.

A detailed description of the process forconverting floating-point to fixed-pointoperations can be highly algorithm-dependent, but in summary, the processbegins by analyzing the dynamic range ofthe data going into the algorithm anddetermining the minimum bit width possi-ble to express that range in an integer form.Given the width of the input data, you cantrace through the operations involved todetermine the bit growth of the data.

For example, to sum the squares of two8-bit numbers, the minimum widthrequired to express the result without loss ofinformation is 17 bits (the square of eachinput requires 16 bits and the sum con-tributes 1 more bit). By knowing the desiredprecision of the output, simply work back-wards through the operation of the algo-rithm to deduce the internal bit widths.

You can implement well-designed fixed-point algorithms in FPGAs quite efficientlybecause you can tailor the width of internaldata paths. Once you know the width of theinputs, internal data paths, and outputs, theconversion of data to fixed point is straight-forward, leading to efficient implementationon both the hardware and software sides.

Occasionally, an application requiresmore complex mathematical functions –sin(), cos(), sqrt(), and others – on the hard-ware side of the partition. For example,these functions may operate on a discreteset of operands or may be invoked inside ofloops, with an argument dependent on theloop index. In these cases, the function canusually be implemented in a modestly sizedlookup table that can often be placed inembedded memories. Functions that takearbitrary inputs can also be used in hard-ware lookup tables with interpolation. Ifyou need more precision, you can try itera-tive or convergent techniques at the cost ofmore cycles. In these cases, the softwarecompilation tools should make good use ofexisting FPGA resources.

What Skills Are Required to Use FPGA Acceleration?Fortunately, hardware design from ESLtools has come a long way in the last fewyears. It is now possible to use C-basedhardware languages to design the hardwareportions of the system. You can easily gen-erate FPGA hardware from the originalsoftware algorithms.

Design tools from Celoxica can compileor synthesize C descriptions directly intoFPGA devices by generating an EDIF netlistor RTL description. These tools commonlyprovide software APIs to abstract away thedetails of the hardware/CPU connectionfrom the application development. Thisallows you to truly treat the FPGA as a co-

processor in the system and opens doors tothe use of FPGAs in many new areas such ashigh-performance computing, financialanalysis, and life sciences.

The languages used to compile softwaredescriptions to the FPGA are specificallydesigned to both simplify the system designprocess with a CPU and add the necessaryelements to generate quality hardware (seeFigure 3). Languages such as Handel-Cand SystemC provide a common code basefor developing both the hardware and soft-ware portions of the system. Handel-C isbased on ANSI-C, while SystemC is a classlibrary of C++.

Any process that compiles custom hard-ware from software descriptions has to dealwith the following issues in either the lan-guage or the tool: concurrency, data types,timing, communication, and resource usage.Software is written sequentially, but efficienthardware must translate that code into par-allel constructs (using potentially multiplehardware clocks) and implement that codeusing all of the proper hardware resources.

Both Handel-C and SystemC add sim-ple constructs in the language to allowexpert users control over this process, whilemaintaining a high level of abstraction forrepresenting algorithmic designs. Softwaredevelopers who understand the concepts ofparallelism inherent to hardware will findthese languages very familiar and can begindesigning accelerated systems using thecorresponding tools in the matter of weeks.

How Much Acceleration Can I Achieve?Table 1 gives a few examples of the acceler-ation achieved in Celoxica customer proj-ects. In another example, a CT (computedtomography) reconstruction-by-filtered-back-projection application takes in sets of512 samples (a tomographic projection)and filters that data first through an FFT. Itthen applies the filter function to the fre-quency sample and applies an inverse FFT.These filtered samples are used to computetheir contribution to each of the pixels inthe 512 x 512 reconstructed image (this isthe back-projection process). The process isrepeated with the pixel values accumulatedfrom each of the 360 projections. The fullyreconstructed image is then displayed on acomputer screen and stored to disk.Running on a Pentium 4 3.0 GHz com-puter, the reconstruction takes about 3.6seconds. For the purposes of this project,the desired accelerated processing time tar-get for reconstruction is 100 ms.

Profiling the application showed that93% of the CPU run time is spent in thefunction that does the back projection,making that function the main target foracceleration. Analyzing the back-projectioncode showed the bottleneck clearly. Forevery pixel location in the final image, twofiltered samples are multiplied by coeffi-cients, summed, and accumulated in thepixel value. This function is invoked 360times, and so the inner loop executesaround 73 million times, each time requir-

SystemC

Core LibrariesSCV, TLM, Master/Slave...

Standard Channels for Various MOCKahn Process Networks, Static Dataflow...

Primitive ChannelsSignal, Timer, Mutex, Semafore, FIFO, etc.

Core Language

Modules, PortsProcesses, Events,

Interfaces, Channels,Event-Driven Sim Kernel

Data Types

Four-Valued Logic/Vectors,Bits and Bit-Vectors,

Arbitrary Width Integers,Fixed Point,

C++ User-Defined Types

ANSI / ISO C Language Standard

Handel-C

Core LibrariesTLM (PAL/DSM), Fixed/Floating Point...

Core Language

par(...), seq(...),Interfaces, Channels,

Bit ManipulationRAM and ROM,

Single Cycle Assignment

Data Types

Bits and Bit-Vectors,Arbitrary Width Integers,

Signals

ANSI / ISO C Language Standard

Figure 3 – Adding hardware capabilities



ing 3 reads and 1 write to memory, 3 mul-tiplies, and several additions.

Fortunately, the back-projection algo-rithm lends itself well to parallel processing.Although the code was written using floatingpoint, the integer nature of the input andoutput data made a fixed-point implementa-tion perfectly reasonable. Working together,engineers from Celoxica and their customeraccelerated the application, running on astandard PC, using a PCI add-in card with aXilinx Virtex™-II FPGA and six banks of512,000 x 32 SRAM. A sensible first-orderpartitioning of the application put the back-projection execution in the PLD, includingthe accumulation of the reconstructed imagepixels, leaving the filtering and all other tasksrunning on the host processor.

The engineers designed a single instanceof the fixed-point back-projection algo-rithm and validated it using the DK DesignSuite. An existing API simplified the datatransfer directly from the software applica-tion to the Xilinx FPGA over the PCI bus.They exploited fine-grained parallelism andrearranged some of the arithmetic to maxi-mize the performance of the algorithm inhardware. The core unit was capable of pro-

cessing a single projection in about 3 msrunning at a 66 MHz clock rate.

The fact that each projection could becomputed independently meant that multi-ple back-projection cores could be runningconcurrently. The engineers determined that18 cores would be enough parallelism toreach the performance goal comfortably.The software would pass 18 sets of 512 sam-ples to the FPGA, which would store themin 18 individual dual-port RAMs. Bothports of these RAMs were interfaced to eachback-projection core, allowing them to readthe two samples required in a single cycle.

The cores were synchronized and the 18contributions were fed into a pipelinedadder tree. The final summed value wasaccumulated with the previous pixel valuestored in one of the external SRAM banks.The accumulated value is written into theother SRAM bank and the two are read inping-pong fashion on subsequent passes.With this architecture sample sets are trans-ferred from software to hardware and thereconstructed image is passed back oncewhen complete. The software was modifiedso that while one set of data was beingprocessed in the FPGA, the next set was

being filtered by the host processor, furtherminimizing design time.

In the end, leveraging the features ofFPGAs and modern ESL tools and takingadvantage of the flexibilities provided by asoftware-based design flow, the exampledesign was able to realize a 45x accelerationfor a real-world application.

ConclusionAccelerating complex software algorithmsin programmable logic and integratingthem into a system that contains proces-sors or DSPs has never been easier or moreaffordable. Advanced FPGA architectures,high-speed interconnect such as PCIExpress and HyperTransport, and thematurity of C-based ESL design toolsreduce the design burden on developersand open up new possibilities for the appli-cation of FPGA technology. Balancing theneeds of power, performance, design time,and cost is driving the FPGA into newmarket applications.

For more information about ESL designsolutions for Xilinx FPGAs, low-cost starterkits, and FPGA-based acceleration bench-marks, visit www.celoxica.com/xilinx.

Application Hardware Co-Processor Software Only

Hough and Inverse Hough Processing 2 sec of Processing Time @ 20 MHz 12 Minutes Processing Time370x Faster Pentium 4 – 3 GHz

AES 1 MB Data Processing/Cryptography RateEncryption 424 ms / 19.7 MBps 5,558 ms / 1.51 MBpsDecryption 424 ms / 19.7 MBps 5,562 ms / 1.51 MBps

13x Faster

Smith-Watermanssearch34 from FASTA 100 sec FPGA processing 6461 sec Processing Time

64x Faster Opteron

Multi-Dimensional Hypercube Search 1.06 sec FPGA @ 140 MHz Virtex-II Device 119.5 sec113x Faster Opteron – 2.2 GHz

Callable Monte-Carlo Analysis64,000 Paths 10 sec of Processing @ 200 MHz FPGA System 100 sec Processing Time

10x Faster Opteron – 2.4 GHz

BJM Financial Analysis5 Million Paths 242 sec of Processing @ 61 MHz FPGA System 6300 sec Processing Time

26x Faster Pentium 4 – 1.5 GHz

Mersenne Twister Random Number Generation 319M 32-bit Integers/sec 101M 32-bit Integers/sec3x Faster Opteron – 2.2 GHz

(Bus Bandwidth BW Limited – Processing ~ 10-20x Ratio)

Table 1 – Augmenting processor performance

by Steve Kruse Corporate Application Engineer CebaTech, [email protected]

Sherry HessVice President, Business DevelopmentCebaTech, [email protected]

Rumors of the possibility of translating theresults of a high-level language such as Cdirectly into RTL have abounded for thelast few years. CebaTech, a C-based tech-nology company, has turned rumors intoreality with the introduction of an untimedC-to-RTL compiler, the central element inan advanced system-level design flow.

Operating in stealth mode for severalyears, the company developed its C-to-RTL compiler for use with IP developmentin TCP/IP networking. The nature ofTCP/IP networking code, combined withCebaTech’s desire to have a complete sys-tem-level design tool, dictated that thecompiler handle both very large and verycomplex code bases. The CebaTech flow isa software-centric approach to FPGA andASIC design and implementation, allowinghardware designers to describe systemarchitectures and software engineers toimplement and verify these architectures ina pure software environment with tradi-tional software tools.


Transforming Software to Silicon

Transforming Software to Silicon

C-based solutions are the future of EDA.

C-based solutions are the future of EDA.

The challenge facing a team of systemarchitects is to make the best design trade-off decisions so that the marketing require-ments document (MRD) has beenconverted to the detailed specification for abalanced product. Only a balanced productis likely to find market acceptance, wherecareful consideration has been given toboth innovative features and power-effi-cient usage of silicon. Until recently, a lackof suitable tools has hampered this all-important early decision-making.

The CebaTech flow allows architectur-al design at a high level of abstraction,namely untimed C. From this perspec-tive, you can make design trade-offs inspeed, area, and power of the end productup-front in C, where design issues aremost easily addressed. The resulting“optimized” C-based design has a muchhigher likelihood of being competitive inthe marketplace, thanks to the energyexpended in honing it to an edge aboveand beyond the MRD.

A product with the correct cost/per-formance balance is what EDA strives toachieve. The eventual market price of aproduct is dictated by factors in the fea-tures and efficiency dimensions.

A Software-Oriented ESL ApproachAside from striking the right balancebetween features and silicon efficiency,engineering teams must also grapple withverification of the chip design. Verificationis by far the biggest component of productdevelopment costs because verification pro-ductivity has not grown at the same rate aschip capacity and performance. This issuehas been acute for many years: the feasibil-ity of integrating new features has not beenmatched by the designer’s/architect’s abilityto guarantee correct implementation.

As represented by Figure 1, CebaTechaims to harness the convergence of threetechnological trends. First is the growingmomentum of the open-source movement,wherein system-level specifications resultfrom the collective intelligence of the glob-al development community. Secondly,while these “golden” standards are freelyavailable, successful implementation in sil-icon depends on domain-specific design

methodology has yielded ground to virtualsystem prototyping (VSP) platforms, asser-tion-based verification, and equivalence-checking methods. Nonetheless, the cost ofdesign continues to grow.

CebaTech bypasses a bottom-upapproach by using a direct top-down com-pilation from the original C source codebase. The pervasiveness of C, especially forsystems software, was a major considera-tion. C’s efficiency and intuitiveness forcapturing hardware design benchmarks hasachieved an overall improvement of at least10x in manpower requirements as well as atleast a 5x order-of-magnitude improve-ment in real-time verification.

The real benefit of compiling from theoriginal C source code becomes apparentduring the verification phase. CebaTech’s

belief is that the entire SoC ought to becoded in C and run as software in a native Cenvironment, where running tool-generatedcycle-accurate C will precisely represent thebehavior of the generated RTL running inan HDL simulator. The architecture of thecompiled RTL is congruent to the C sourceand you can achieve extensive functionalverification in the C environment.

As shown in Figure 2, the CebaTechcompiler generates a cycle-accurate Cmodel in addition to synthesizable RTL.The functional correctness of the compiler-generated RTL can be shown using formaltools that prove equivalency with the cycle-

expertise. Finally, the cost of product devel-opment must be kept in check throughincreasing use of the latest system-leveltools. Although the toolflow pioneered by

CebaTech has already proven its merit in IPnetworking, it is applicable to other emerg-ing market sectors

CebaTech’s approach is a logical step inthe evolution of the EDA industry, especial-ly given the current bottleneck in verifica-tion. For the most part, complexsystem-on-a-chip (SoC) products are stilldeveloped in the traditional bottom-upway. Developers learned to use hardwaredescription languages and write test bench-es, and eventually came to rely on FPGA-based prototyping systems anddevelopment environments for embeddedcores. In recent years, traditional simulation


Open-SourceMovement

NetworkingInfrastructure

(Protocols)

ESL Tools

Software Design Environment

C Module Cycle-Accurate C Model

C2R Compiler

RTL Simulation Environment

Memory and FPU Resources RTL for Synthesis

FPGA Design Environment

Figure 1 – The evolution and eventual convergence into CebaTech’s technology andmethodology of C-based products that take

“software to silicon” efficiently.

Figure 2 – The C2R compiler from CebaTech allows untimed ANSI C code bases to be rapidly implemented using an FPGA approach.

accurate C, which in turn hasbeen validated in the pure Csoftware environment. Thecompiler is able to map arraysand other C data structuresonto on-chip or external mem-ory resources. It has the struc-ture of a typical softwarecompiler, complete with linker.Thus, it can quickly compiledesigns ranging from small testcases to truly enormous soft-ware systems, making it feasi-ble for a small team ofengineers to implement fullhardware offload for a systemas complex as the OpenBSDTCP/IP stack. There are enormous time-to-market and die-size advantages to thismethodology.

A Real-World FPGA ExampleThe CebaTech flow is unbounded bydesign size. It can address designs fromsmall (several hundred lines of C code) toimmense (45,000 lines or more, as is thecase with TCP/IP). To give you an overviewof an implementation with the CebaTechcompiler, let’s look at an FPGA conversionof a C-based compression algorithm.Although quite modest in the amount ofcode, this example demonstrates the powerand uniqueness of the CebaTech flow forarchitecture exploration/design trade-offswith respect to area and performance. Ithighlights the narrowing of the verificationgap and provides an excellent vehicle withwhich to demonstrate CebaTech’s software-to-silicon methodology.

To illustrate the extent to which theCebaTech flow utilizes pure C source lan-guage constructs to describe system archi-tecture, we developed and compiled sixincremental versions of the compressioncode. All versions function identicallywhen compiled and run natively on a com-puter system; however, each successive ver-sion improves the performance andsynthesis results of the generated RTL,using a small number of source codechanges for each step. (Further refinementbeyond the architectures explored in thisarticle is also possible.)

We can summarize the six versions ofthe compression source code as:

• Original restructuring of the sourcecode. The bulk of the program is putinto a “process” that can be directlycompiled into an RTL module. Thisinitial restructuring does not try forany optimization, and is indicated asRevision 1 in Table 1.

• A first cut at optimization where cer-tain critical functions are placed intoseparate modules. This prevents themfrom in-line expansion and yields abaseline architecture, shown asRevision 2 in Table 1. Regrouping thefunctions into two parallel processes isshown as Revision 3. The area-opti-mized implementation yields a triplingof clock frequency and the number ofclock cycles decreases from 56 to 33.Revision 4 involves functions that arecommon to different callers.

• Flagging these functions as sharedresults in a quick area reduction.Various other common optimizations –notably loop unrolling – occur duringRevision 5. A process-splitting opera-tion allowed three simultaneous access-es to what was previously a singleROM. This step brought the clockcycles from 33 down to 12.

• The final iteration in this example,shown as Revision 6, is an example ofa vendor-specific optimization. TheXilinx® Virtex™-II family makes both

distributed and block on-chip RAMavailable. Using the latter implementa-tion allows storage utilization to dropby almost 50%.

For all of these C-based design instances,we synthesized the resulting Verilog RTLnetlist using Xilinx ISE™ software v8.1against a Virtex-4 XC4VLX25-12 FPGA.

For this C-based design flow, weexpended less than one week to understandthe compression code base, create the testbenches used for RTL simulation, performthe six design explorations, and synthesizeand verify the results. You can see that theCebaTech approach enables much speedierproduct development and delivery with thebest price/performance balance.

ConclusionCebaTech has been using a compiler-driv-en ESL flow to achieve rapid prototypingof leading-edge networking products forseveral years. The compiler was forged inthe crucible of high-performance SoCdevelopment, but is now available toother forward-looking development teamswho need to quickly develop and delivernovel SoCs that strike the right balancebetween price and performance.

To learn more about CebaTech’s ESLflow, visit www.cebatech.com and registerto download our compiler white paper.Alternatively, you can call (732) 440-1280 to learn how to apply our flow toyour current SoC design and verificationchallenges.


Xilinx XC4V – Resource Utilization (%) Fmax (MHz)

Architecture Revision Details Slices Slice Flops LUTs SelectRAM # ROM Clocks Area Speed

1 – Baseline Architecture 1 146

2 – Eliminate Redundant Functions 89 24 68 18 1 56 12 58

3 – Define Parallel Processes 44 14 33 18 1 33 40 69

4 – Optimize Any Shared Functions 20 8 17 18 1 33 35 57

5 – ROM and FIFO Improvements 20 9 18 18 3 12 36 52

6 – Utilize Block SelectRAM 20 9 18 10 3 12 36 52

Table 1 – The results of exploring six versions of compression source code

by Stefan Möhl CTO Mitrionics [email protected]

Göran SandbergVice President of Product MarketingMitrionics [email protected]

Ever-increasing computing performance isthe main development driver in the world ofsupercomputing. Researchers have longlooked at using FPGAs for co-processors asa promising way to accelerate critical appli-cations. FPGA supercomputing is now gen-erating more interest than ever, with majorsystem vendors offering general-purposecomputers equipped with the largest Xilinx®

Virtex™-4 FPGAs, thus making the tech-nology accessible to a wider audience.

There are compelling reasons to give seri-ous consideration to FPGA supercomputing,especially given the significant performancebenefits you can achieve from using FPGAsto accelerate applications. Typical applica-tions can be accelerated 10-100x comparedto CPUs, making FPGAs very attractivefrom a price/performance perspective.

What is even more important is thatFPGAs use only a fraction of the power percomputation compared to CPUs. Increasedcomputing power with traditional clustersrequires more and more electrical power.Today, the cooling problems in large com-puting centers have become major obsta-cles to increased computing performance.


A Novel Processor Architecture for FPGA Supercomputing A Novel Processor Architecture for FPGA Supercomputing The Mitrion Virtual Processor is a fine-grain massively parallel reconfigurable processor for FPGAs.The Mitrion Virtual Processor is a fine-grain massively parallel reconfigurable processor for FPGAs.

understanding. This requires the programto evolve over its life cycle. In contrast,circuit designs are usually created onceand then used without changes for thelife span of the target application.

The good news is that in a supercomput-er, the FPGA is there to perform comput-ing. This allows you to create softwaredevelopment methods for FPGAs that cir-cumvent the complexity and methodologyof general hardware design.

The Mitrion Virtual ProcessorThe key to running software in FPGAs is toput a processor in the FPGA. This allows youto program the processor instead of designingan electronic circuit to place in the FPGA. Toobtain high performance in an FPGA, the cir-cuitry of the processor design is adapted tomake efficient use of the resources on theFPGA for the program that it will run. Theresult is a configuration file for the FPGA,which will turn it into a co-processor runningyour software algorithm. This approachallows you as a software developer to focus onwriting the application instead of gettinginvolved in circuit design. The circuit designprocess has already been taken care of with theMitrion Virtual Processor from Mitrionics.

A Novel Processor ArchitectureHow do you get good performance from aprocessor on an FPGA when typical FPGAclock speeds are about 10 times slower thanthe fastest CPUs? Plus, the FPGA requiresand consumes a certain amount of overheadfor its reconfigurability. This means that ithas a 10-100x less efficient use of availablechip area compared to a CPU. Still, to per-form computations 10-100x faster than aCPU, we need to put a massively parallelprocessor onto the FPGA. This processorshould also take advantage of the FPGA’sreconfigurability to be fully adapted to theprogram that it will run. Unfortunately, thisis something that the von Neumann proces-sor architecture used in traditional CPUscannot provide.

To overcome the limitations of today’sCPU architecture, the Mitrion VirtualProcessor uses a novel processor architecturethat resembles a cluster-on-a-chip.

well as computing co-processors in comput-ers, provided that you have loaded the prop-er circuit design. Traditional developmentmethods for FPGAs are focused on the hard-ware design aspects, which are very differentfrom software development (see Table 1).

FPGAs in Supercomputers Enable a New ParadigmPutting FPGAs in computers to acceleratesupercomputing applications changes theplaying field and rules dramatically.Traditional hardware design methods areno longer applicable for several reasons:

• Supercomputer programmers are typical-ly software developers and researcherswho would prefer to spend their time onresearch projects rather than acquiringthe expertise to design electronic circuits.

• The sheer complexity of the designsrequired to compute real-world super-computing problems is prohibitive andusually not viable.

• Hardware design methods are targetedat projects where a single circuit designis used in a large number of chips.With FPGA supercomputers, eachchip will be configured with many different circuit designs. Thus, thenumber of different circuit designs will be greatly multiplied.

• Hardware design tools and methodsfocus on optimizing the size and costof the final implementation device.This is significantly less important in asupercomputer because the investmentin the FPGA has already occurred.

• Programs run in supercomputers areoften research tasks that continuouslychange as the researcher gains deeper

Plus, the development of FPGA devicesis still keeping up with Moore’s law. In prac-tice, this means that the performance ofFPGA-based systems could continue todouble every 18 months, while CPUs arestruggling to deliver increased performance.

FPGAs as Supercomputing Co-ProcessorsFPGAs are highly flexible general-purposeelectronic components. Thus, you mustconsider their particular properties whenusing them to accelerate applications in asupercomputer.

Compared to a CPU, FPGAs are SlowIt may seem like a contradiction to claim thatFPGAs are slow when they can offer suchgreat performance benefits over CPUs. Butthe clock speed of an FPGA is about 1/10thof a CPU. This means that the FPGA has toperform many more operations in parallelper clock cycle if it is to outperform a CPU.

FPGAs are Not Programmable (From a Software Developer’s Perspective)Software developers taking their first stepsin FPGA supercomputing are always sur-prised to find out that the P for program-mable in the acronym FPGA means “youcan load a circuit design.” Without such acircuit design, an FPGA does nothing – itis just a set of electronic building blockswaiting to be connected to each other.

Creating a circuit design to solve acomputational problem is not “program-ming” from a software developer’s pointof view, but rather a task for an experi-enced electrical engineer.

FPGAs Require Hardware DesignWith their general-purpose electronic com-ponents, FPGAs are usable in any electronichardware design. That is why they work so

Hardware Design Software Development

Driven by Design Cycle Driven by the Code-Base Life Cycle

Main Concern: Device Cost (Size, Speed) Main Concern: Development Cost and Maintenance

Precise Control of Electrical Signals Abstract Description of Algorithm

Table 1 – Hardware/software development comparison



Normal computer clusters comprise anumber of compute nodes (often standardPCs) connected in a fixed network. Thebasic limitation of such a cluster is the net-work latency, ranging from many thou-sands up to millions of clock cycles,depending on the quality of the network.This means that each node has to run alarge block of code to ensure that it has suf-ficient work to do while waiting for aresponse to a message. Lower latency in thenetwork will enable each node to do lesswork between communications.

In the architecture of the MitrionVirtual Processor, the entire cluster is onthe same FPGA device. The network isfully adapted in accordance with the pro-gram requirements, creating an ad-hoc,disjunct network with simple point-to-point connections where possible,switched only where required. This allowsthe network to have a guaranteed latencyof a single clock cycle.

The single clock cycle latency network isthe key feature of the Mitrion VirtualProcessor architecture. With a single-cyclelatency network, it becomes possible forthe nodes to communicate on every clockcycle. Thus, the nodes can run a block ofcode comprising only a single instruction.In a node that runs only one instruction,the instruction scheduling infrastructure ofthe node is no longer necessary, leaving

only the arithmetic unit for that specificinstruction. The node can also be fullyadapted to the program.

The effect of this full adaptation of thecluster in accordance with the program isthat the problem of instruction schedulinghas been transformed into a problem ofdata-packet switching. For any dynamicpart of the program, the network mustdynamically switch the data to the correctnode. But packet switching, being a net-work problem, is inherently parallelizable.This is in contrast to the inherently sequen-tial problem of instruction scheduling in avon Neumann architecture.

This leaves us with a processor architecturethat is parallel at the level of single instruc-tions and fully adapted to the program.

Mitrion-CTo access all of the parallelism availablefrom (and required by) the Mitrion VirtualProcessor, you will need a fully parallel pro-gramming language. It is simply not suffi-cient to rely on vector parallel extensions orparallel instructions.

We designed the Mitrion-C program-ming language to make it easy for program-mers to write parallel software that makesthe best use of the Mitrion Virtual Processor.Mitrion-C has an easy-to-learn C-familysyntax, but the focus is on describing datadependencies rather than order of execution.

The syntax of Mitrion-C is designed tohelp you achieve high performance in aparallel machine, just like ANSI C isdesigned to achieve high performance in asequential machine. Thus the Mitrion-Ccompiler extracts all of the parallelism ofthe algorithm being developed. We shouldnote, though, that Mitrion-C is purely asoftware programming language; no ele-ments of hardware design exist.

The Mitrion PlatformTogether with Mitrion-C and the MitrionVirtual Processor, the Mitrion SoftwareDevelopment Kit (SDK) completes theMitrion Platform (see Figure 1). TheMitrion SDK comprises:

• A Mitrion-C compiler for the MitrionVirtual Processor

• A graphical simulator and debuggerthat allows you to test and evaluateMitrion-C applications without havingto run them in actual FPGA hardware

• A processor configuration unit thatadapts a Mitrion Virtual Processor tothe compiled Mitrion-C code

The Mitrion Platform is tightly coupledto Xilinx ISE™ Foundation™ software forsynthesis and place and route, targetingXilinx Virtex-II and Virtex-4 devices. It fea-tures a diagnostic utility that analyzes theoutput from these applications and canreport any problems it encounters in a for-mat that does not require you to haveFPGA design skills.

ConclusionThe novel, massively parallel processor archi-tecture of the Mitrion Virtual Processor torun software in an FPGA is a unique solu-tion. It is a solution readily available on themarket today that addresses the major obsta-cles to the widespread adoption of FPGAs asa means to accelerate supercomputing appli-cations. With the Mitrion Virtual Processor,you will not need hardware design skills tomake software run in FPGAs.

For more information about theMitrion Virtual Processor and the MitrionPlatform, visit www.mitrionics.com or e-mail [email protected]

if

if if

ext

int

y

*

*

*

*

=

=

*

x

xydx

dy

RAM Memory

RAM Memory

RAM Memory

RAM Memory

Compiler Configurator Processor

{ foreach (landmark in <0..LMS>) { int:48 dx – px[landmark] – x; int:48 dy – py[landmark] – y; int:48 r2 – dx*dx + dy*dy;

int:48 ext – if(r2 –– 0) 0 else log(r2 * r2; } ext;

distx – foreach (e in ext by i) {

Figure 1 – The Mitrion Platform

by Barry O’Rourke Application Engineer [email protected]

Richard [email protected]

What do you do when the CPU in a board-level system cannot cope with the softwareupgrades that the next-generation productinevitably requires? You can increase theclock speed, but when – not if – thatapproach runs out of steam, you mustdesign in a faster CPU, or worse, an addi-tional CPU. Worse yet, you could addanother CPU board if the end product’sform factor and cost target allow it.

In this article, we’ll discuss some of thepros and cons of these approaches and

describe a programmable coprocessorsolution that leverages an on-boardXilinx® FPGA to turbocharge the existingCPU and deliver the pros with none ofthe cons. You do not need processordesign expertise, nor will you have toredevelop the software. Moreover, if theboard already deploys an FPGA with suf-ficient spare capacity, the incremental sili-con cost of the solution is zero.

The Brute Force ApproachDeploying faster or additional CPUs is anapproach that is scalable as long as the soft-ware content growth remains relatively lin-ear. However, software content growth isnow exponential. A faster CPU mightsolve the problem temporarily, but it willsoon be overwhelmed, necessitating a mul-tiprocessor solution.

Many products cannot bear the redesignand additional silicon costs of a multiproces-

sor solution. Design teams do not often havethe time to implement such a solution, andmost teams certainly do not have theresources to re-architect the whole multi-processor ensemble every time the softwarecontent exceeds the hardware’s capacity toprocess it. It is more than just a hardwaredevelopment problem – it is a major softwarepartition and development problem.

When a methodology breaks, you mustre-examine the fundamentals. Why can’t ageneral-purpose (GP) CPU deliver moreprocessing power? The answer is that mostsuch CPUs offer a compromise betweencontrol functions and the parallel process-ing capability required by computationallyintensive software. That capability is limit-ed by a lack of both instruction-level paral-lelism and parallel processing resources.

When you deploy an additional general-purpose CPU to assist in the execution of,for example, a 20,000-line MPEG4 algo-


Turbocharging Your CPU with anFPGA-Programmable CoprocessorTurbocharging Your CPU with anFPGA-Programmable CoprocessorCoprocessor synthesis adds the optimal programmable resources.Coprocessor synthesis adds the optimal programmable resources.

rithm, you add only limited parallel pro-cessing capability, along with unnecessarycontrol logic, all wrapped up in a packagethat consumes valuable board space. It is anexpensive, brute force approach thatrequires the multiprocessor partitioning ofsoftware that has generally been developedfor only one processor; software redevelop-ment where the additional CPU’s instruc-tion set differs from that of the originalCPU; the development of complex cachingschemes; and the development of multi-processor communication protocols – aproblem exacerbated by the use of hetero-geneous real-time operating systems.

However, even this may not deliver therequisite results. A major Japanese compa-ny recently published data that illustratesthis point. The company partitioned soft-ware developed on one CPU for executionon four CPUs and achieved a maximumacceleration of only 2.83x. A less-than-optimal partition may have failed to utilizethe additional parallel processing resourcesas efficiently as possible, but communica-tions overhead surely played a role as well.In other words, maximizing CPU per-formance does not necessarily maximizesystem performance.

The Cascade Coprocessor Synthesis SolutionYou can circumvent these problems bysynthesizing an FPGA-implemented pro-grammable coprocessor that, acting as anextension of the CPU, supplies the missingparallel processing capability. It increasesboth instruction-level parallelism and par-allel processing resources. A mainstreamengineer can design it in a matter of dayswithout having any processor designexpertise; CriticalBlue’s Cascade synthesiz-er automatically architects and implementsthe coprocessor.

Moreover, Cascade optimizes cachearchitecture and communications over-head to prevent performance from being“lost in traffic.” In other words, it boostsnot only CPU performance but overallsystem performance.

The FPGA coprocessor accelerates soft-ware offloaded from the main CPU as-is,so no software partitioning and redevelop-ment are required, although you can also

mized for that software – all with softwaredeveloped for single processor operation.Adding multiple coprocessors can be moreeffective, less costly, and less time consum-ing than deploying multiple processors.

How Does It Work?First, you should determine which softwareroutines must be offloaded from the CPU.Cascade analyzes the profiling results of theapplication software running on the CPU toidentify cycle-hungry candidates (Figure 1).

Cascade then automatically analyzesthe instruction code – including bothcontrol and data dependencies – and mapsthe selected tasks onto multiple coproces-sor architecture candidates that complywith user-defined clock rate and gatecount specifications. For each candidate,the tool provides estimates of perform-ance, gate count, and coprocessor/CPUcommunication overhead. You then selectthe optimum architecture.

co-optimize software and hardware.Cascade thus supports both full legacy soft-ware reuse and new software deployment.The former is especially important if youhave inherited the legacy software and donot necessarily know how it works.

Unlike fixed-function hardware, aCascade FPGA coprocessor is not restrictedto executing only one algorithm, or partthereof. Cascade can synthesize an FPGAcoprocessor that executes two or more dis-parate algorithms, often with only margin-ally greater gate counts than a coprocessoroptimized to execute only one.

A Cascade FPGA coprocessor is repro-grammable. Although it is optimized toboost the execution of particular algo-rithms, it will execute other algorithms, too.

Adding a Cascade FPGA coprocessorcircumvents the problems of adding aCPU. As you add additional software tothe system, it can run on an existing FPGAcoprocessor or you can add a new one opti-


Figure 1 – Cascade coprocessor synthesis flow

Cascade generates an instruction- andbit-accurate C model of the selectedcoprocessor architecture for use in per-formance analysis. Cascade uses the Cmodel together with the CPU’s instructionset simulator (ISS) and the stimuli derivedfrom the original software to profile per-formance, as well as analyze memoryaccess activity and instruction executiontraces. The analysis also identifies theinstruction and cache “misses” that causeso many performance surprises at systemintegration. Because Cascade can generatecoprocessor architectures so quickly, youcan perform multiple “what if ” analysesquite rapidly. The model is also used tovalidate the coprocessor within standard Cor SystemC simulation environments.

In the next step, Cascade generates thecoprocessor hardware in synthesizableRTL code, which you verify using thesame stimuli and expected responses asthose used by the CPU. The tool gener-ates the coprocessor/CPU communica-tions circuitry. It also simultaneouslygenerates coprocessor microcode, modify-ing the original executable code so thatfunction calls are automatically directedto a communications library. The librarymanages coprocessor handoff and com-municates parameters and results betweenthe CPU and the coprocessor. Cascadegenerates microcode independently of thecoprocessor hardware, allowing newmicrocode to be targeted at an existingcoprocessor design.

The Cascade tool provides interactive fea-tures if you wish to achieve even higher per-formance. For instance, it enables you tomanually optimize code, co-optimize codeand hardware, and deploy custom functionalhardware units to achieve greater processingperformance. You can explore these manualoptions in a “what-if” fashion to determinethe optimal configuration.

What Are The Results?The results for single and multiple algorithmexecution are shown in Table 1. Executing asingle real-time image processing algorithmas-is was 5x faster than on the CPU. Softwarecode optimization and a custom functionalunit increased that to 15x.

In the case of multiple algorithms, acoprocessor synthesized for SHA-1, asecure hash algorithm for cryptographyapplications, achieved a boost of 5x. A sep-arate coprocessor with a custom functionalunit synthesized for MD5, another hashalgorithm, achieved 10x. A coprocessoroptimized for both achieved 6.4x – usingonly 8% more gates than either of the twosingle coprocessors.

The power of reprogrammability isdemonstrated in Table 2. One of our cus-tomers synthesized a coprocessor to exe-cute an MP3 decoder algorithm,achieving a 2.13x boost despite usingunoptimized code. The designer thengenerated a coprocessor to execute anMPEG1 Layer 2 encoder algorithm, alsowith no code optimization, achieving a1.62x boost. When each algorithm wasexecuted on the other’s coprocessor, theyachieved boosts of 1.18x and 1.19x,respectively. Code optimization wouldhave improved this performance further,and custom functional units to executeserial operations even more.

How Can You Implement It?You can implement Cascade FPGAcoprocessors in multiple Xilinx families:Virtex™-II, Virtex-II Pro/Pro X,Spartan™-3/3E/3L, and Virtex-4 FPGAs.

Cascade’s synthesizable RTL outputworks with Synopsys, Synplicity, andXilinx synthesis tools and is fully inte-grated into the Xilinx ISE™ tool flow,while the verification approach workswith the Xilinx ISE tool flow and MentorGraphics’s ModelSim XE/PE.

Cascade can target any Xilinx-popu-lated board without translation or modi-fication – no family-specific design kit isrequired.

Conclusion If you want to boost the processing powerof your design without deploying new orimproved microprocessors, and if you want to use software without reparti-tioning and redevelopment, contactCriticalBlue at [email protected]. Wecan tell you how to do it – without havingto become a processor designer.


Application Single Algorithm Multiple AlgorithmsAlgorithm Real-Time Image Processing SHA-1 MD5 SHA-1 + MD5Lines of Original Code ~200 150 700 850Code Modified? No Yes No Yes Same as MD5Custom Functional Unit? No Yes No Yes Same as MD5Boost vs. CPU 5x 15x 5x 10x 6.4xEffort (in Days) 1 3 1 5 1 More Day

Application ReprogrammabilityAlgorithm MP3 Decoder MPEG1 Layer 2 EncoderLines of Original Code ~9,700 ~16,700Code Modified? No NoCustom Functional Unit? No NoBoost vs. CPU on its Own Dedicated Coprocessor 2.13x 1.62xBoost vs. CPU on the Other’s Coprocessor 1.18x 1.19xEffort (in days) 2 1

Table 1 – Single and multiple algorithm execution

Table 2 – Coprocessor reprogrammability with as-is code re-use

by Bryon MoyerVP, Product MarketingTeja Technologies, [email protected]

One of the key preferences of a typicalembedded designer is stability in the hard-ware platforms on which they program;poorly defined hardware results in recodinghassles more often than not. But a com-pletely firm, fixed hardware platform alsocomes with a set of constraints that pro-grammers must live with. These constraints– whether simply design decisions or out-right bugs – can force inelegant codingworkarounds or rework that can be cumber-some and time-consuming to implement.

By combining an FPGA platform with awell-defined multicore methodology, youcan implement high-performance packetprocessing applications in a manner that

gives software engineers some control overthe structure of the computing platform,potentially saving weeks or months of cod-ing time and reducing the risk of delays.

Much of the hardware design processgoes into defining a board. Elements suchas the type of memory, bus protocol, andI/O are defined up front. If a fixed proces-sor is used, then that is also defined earlyon. But for gigabit performance on packetprocessing algorithms, for example, a sin-gle processor typically won’t cut it, andmultiple processors are necessary.

The best way to build a processing fab-ric depends on the software being run. Byusing an FPGA to host the processing,you can defer making specific decisionson the exact implementation until youknow more about the needs of the code.The new Teja FP platform provides amethodology and multicore infrastruc-

ture on Xilinx® Virtex™-4 FPGAs, allow-ing you to decide on the exact configura-tion of the multicore fabric after the codehas been written.

When Software Engineers Design HardwareHardware and software design are two fun-damentally different beasts. No matter howmuch hardware design languages are madeto look like software, it is still hardwaredesign. It is the definition of structure, andprocesses ultimately get instantiated asstructure. However, it is clear that softwareengineers are designing more and more sys-tem functionality using C programmingskills; tools now make it possible for thisfunctionality to target software or hardware.

The software approach is much moreprocess-oriented. It is the “how to do it”more than the “what to build” because tra-ditionally there has been nothing to build –


Tune Multicore Hardware for SoftwareTune Multicore Hardware for Software Teja FP and Xilinx FPGAs give you more

control over hardware configuration.Teja FP and Xilinx FPGAs give you more control over hardware configuration.

the hardware was already built. A trulysoftware-based design approach includeskey functionality that is not built into thestructure but is executed by the structure ina deployed system. The benefit of software-based functionality is flexibility: you canmake changes quickly and easily up untiland even after the system has shipped.FPGAs can also be changed in thefield, but software design turns canbe handled much more quicklythan hardware builds.

Because hardware and softwaredesign are distinct, designers of oneor the other will think differently. Ahardware engineer will not becomea software engineer just by chang-ing the syntax of the language.Conversely, a software engineer willnot become a hardware engineersimply by disguising the hardwaredesign to resemble software. Soallowing a software engineer to par-ticipate in the design of a process-ing fabric cannot be done blithely.In addition, turning a hardware-oriented design over to a softwareengineer is not likely to be met withcheers by hardware designers, soft-ware designers, or project man-agers. Hardware decisions made bysoftware engineers must be possibleusing a methodology that resonateswith a software engineer in a lan-guage familiar to the software engineer.

The key topology of the processing fab-ric for a multicore packet processing engineis the parallel pipeline, shown in Figure 1.Such an engine comprises an array ofprocessors, along with possible hardwareaccelerators. Solving the problem meansasking the following questions:

• How many processors do I need?

• How should they be arranged?

• How much code and local data storedoes each processor require?

• What parts of the code need hardwareacceleration?

Let’s take these questions case by caseand assemble a methodology that works forsoftware engineers.

intuitively. There’s no need to figure outwhere the exact cycle midway point is.

Let’s say that with six processors you willhave a two-stage pipeline. Now you need tofigure out how many processors go in eachstage; you can do this by profiling the cyclecounts of each of the partitions and divid-ing again by the cycle budget. So if the first

partition took 380 cycles, it willrequire 4 processors; that leaves140 cycles for the second stage,which will require 2 processors.(The cycle counts of the two parti-tions won’t actually add neatly tothe cycle count of the unparti-tioned program, but it is closeenough for our purposes.) So thistwo-stage pipeline will have fourprocessors in the first stage and twoin the second stage. By using XilinxMicroBlaze™ soft cores, you caninstantiate any such pipeline givensufficient logic resources.

By contrast, a fixed pipelinestructure would have a pre-deter-mined number of processors ineach stage. This would mandate aspecific partition, which can take afair bit of time to implement.Instead, the pipeline is made toorder and can be irregular, with adifferent number of processors ineach stage. So it doesn’t really mat-ter where the partition occurs. The

hardware will be designed around the par-titioning decision instead of vice versa.

The question of how a software engi-neer can implement this is key. Teja hasassembled a set of APIs and a processingtool that allows the definition of the hard-ware platform in ANSI C; the tool exe-cutes that program to create the processingplatform definition in a manner that theXilinx embedded tools can process. Thisset of APIs is very rich and can be manip-ulated at a very low hardware level, butmost software engineers will not want todo this. Thus, Teja has put together a “typ-ical” pipeline definition in a parameterizedfashion. All that is required to implementthe pipeline in the example is to modifythe two simple #define statements in aconfiguration header file.

Processor Count and ConfigurationThe number of processors required is moreor less a simple arithmetic calculation basedon the cycle budget and the number ofcycles required to execute code. The cyclebudget is a key parameter for applicationswhen you have a limited amount of time todo the work. With packet processing, for

example, the packets keep coming, so youonly have so many cycles to finish yourwork before the next packet arrives. If yourcode takes longer than that, you need toadd more processors. For example, if thecycle budget is 100 cycles and the coderequires 520 cycles, then you need 6processors (520 divided by 100 and round-ed up). You can fine-tune this, but it worksfor budgetary purposes. The cycle count isdetermined by profiling, which you canperform using Teja’s tools.

The next question is how to arrange theprocessors. The simplest way to handle thisis to decide if you want to partition yourcode to create a pipeline. A pipeline usesless resources but adds latency. If you wantto partition your code, then pick an obvi-ous place – something natural that works

ProcessingEngine

ProcessingEngine

ProcessingEngine

ProcessingEngine

MicroBlaze

PrivateBlock RAM

Offload(s)(Optional)

Figure 1 – A parallel pipeline where each engine comprises aMicroBlaze processor, private memory, and optional offloads.



The following statements define a two-stage pipeline, with four engines in the firststage and two in the second stage:

#define PIPELINE_LENGTH 2

#define PIPELINE_CONFIG {4,2};

From this and the predefined configura-tion program, the TejaCC program canbuild the pipeline. Of course, if for what-ever reason the pipeline configurationchanges along the way, similar edits caneasily take care of those changes.

MemoryThe third question is about the amount ofmemory required. In a typical system, youare allocated a fixed amount of code anddata store. If you miss that target, a signifi-cant amount of work can go into squeezingthings in. With FPGAs, however, as long asthe amount of memory needed is withinthe scope of what’s available in the chip, itis possible to allocate a tailored memorysize to each processor. More typically, allwould have the same size (since there is a2K minimum allocation block, limiting theamount of fine tuning possible). Codecompilation provides the size required, andthe configuration header file can be furtheredited with the following statements,which in this example allocates 8 KB ofmemory for both code and data store:

#define CPE_CODE_MEM_SIZE_KB 8

#define CPE_DATA_MEM_SIZE_KB 8

Offloads for AccelerationThe fourth question deals with the creationof hardware accelerators. There may be partsof the code that take too many cycles. Thosecycles translate into more processors, and ahardware accelerator can eliminate some ofthose processors. As long as the hardwareaccelerator uses fewer gates than the proces-sors it is replacing, this will reduce the size ofthe overall implementation.

Teja has a utility for creating suchaccelerators, or offloads, directly fromcode. By annotating the program, the util-ity will create:

• Logic that implements the code

• A system interface that allows theaccelerator to plug into the processorinfrastructure

• An invocation prototype that replacesthe original code in the program

• A test bench for validating the offloadbefore integrating it into the system

Once an offload is created, the cyclecount is reduced, so you will have to re-evaluate the processor arrangement. Butbecause it is so easy to redefine thepipeline structure, this is a very straight-forward task.

A Straightforward MethodologyPutting all of this together yields theprocess shown in Figure 2. You candefine offloads up-front (for obvioustasks) or after the pipeline has been con-figured (if the existing code uses toomany processors).

The controls in the hands of the soft-ware engineer are parameters that are nat-ural or straightforward for a softwareengineer and expressed in a natural soft-ware language – ANSI C. All of the detailsof instantiating hardware are handled bythe TejaCC program, which generates aproject for the Xilinx EmbeddedDevelopment Kit (EDK). EDK takes careof the rest of the work of compiling/syn-thesis/placing/routing and generating bit-streams and code images.

In this manner, hardware engineerscan design the boards, but by usingFPGAs they can leave critical portions ofthe hardware configuration to the soft-ware designer for final implementation.This methodology also lends itself well tohandling last-minute board changes (forexample, if the memory type or arrange-ment changes for performance reasons).Because the Teja tool creates the FPGAhardware definition, including memorycontrollers and other peripherals, boardchanges can be easily accommodated.The net result is that the hardware targetcan be adapted to the software so that thesoftware designer doesn’t have to spendlots of time coding around a fixed hard-ware configuration.

The Teja FP environment and infra-structure make all of this possible by takingadvantage of the flexible Virtex-4 FPGAand the MicroBlaze core. Armed with this,you can shave weeks and even months offof the development cycle.

Teja also provides high-level applica-tions; these applications can act as startingpoints for launching a project and reduc-ing the amount of work required. By com-bining a time-saving flexible methodologywith pre-defined applications, creators ofnetworking equipment can complete theirdesigns much more quickly.

For more information on Teja FP, contact [email protected].

Determine CycleBudget

Identify Offload(s)(Optional)

Profile Code

Calculate Numberof Processors

Partition Code

Identify Offload(s)(Optional)

Profile PartitionCode

Configure Pipeline

Figure 2 – Process for configuring a parallel pipeline

by Jason Brown, Ph.D. CEO [email protected]

Marc Epalza, Ph.D.Senior Hardware [email protected]

Let’s say that you have been tasked toensure that your company has an H.264solution that supports high-definitionvideo decoding at 30 frames per second.You are not a video expert. What do youdo? You could get on the Internet and per-form a Web search for H.264; before youknow it, you’ll have the source code and beon your way.

You managed to compile the code andget it running on the target, but it decodesat a whopping two frames per second.Now what? After sifting through pages andpages of profiling data, you find somehotspots, but you are not sure which partsto focus on to maximize the accelerationand you do not have enough time to try tooptimize them all.

Many of us have found ourselves in thissituation at one time or another. Maybeyou have even delivered a solution, but notwithout a lot of sweat and tears.

Mimosys ClarityThe Mimosys Clarity software tool automat-ically identifies candidate hardware accelera-tors directly from application C source code,guided by the execution profile of the appli-cation. It also takes a set of constraints on thenumber of I/O parameters available to theaccelerator and a model of the executioncosts for operations on the PowerPC™ andin Xilinx® FPGA fabric.


Automatically Identifying and CreatingAccelerators Directly from C CodeAutomatically Identifying and CreatingAccelerators Directly from C CodeThe Mimosys Clarity tool exploits the power of the Virtex-4 FX PowerPC APU interface.The Mimosys Clarity tool exploits the power of the Virtex-4 FX PowerPC APU interface.

Clarity’s approach to profiling is uniquein that the execution profiling information isvisually presented in the tool as annotationsto basic blocks of control flow graphs(CFGs), as shown in Figure 1. The nodes ofa control flow graph represent the basicblocks of the C source code, where a basicblock is any sequence of C expressions with-out conditions. The edges of a CFG repre-sent an execution path between basic blocks,where a specific path is followed (or taken) ifa particular condition is true or false.

Because each basic block is a sequenceof expressions, it can be represented by adata flow graph (DFG), the nodes of whichare operations (+, -, *, /) and the edges val-ues (Figure 2). In the vast majority of cases,this information is not automatically creat-ed or visualized. Furthermore, this visuali-zation, if done at all, is typically static andthus unable to retain one of the mostimportant aspects: the correspondencebetween the graphs and the source codefrom which they came.

provides little or no benefit, it indicates thatyou need to rework the application sourcecode to expose improved accelerationopportunities. Furthermore, the automaticidentification algorithm takes both local(DFG) and global (execution profile) infor-mation into account, thus providing uniqueinsight into where to focus your efforts inoptimizing an application.

In the iterative process of optimizingan application, the identification algo-rithm will discover a set of acceleratorsthat will satisfy your requirements.However, the task remains to realize themon the Virtex-4 FX platform. You mustperform two important steps to achievethis: first, create an HDL definition of theaccelerators that includes the necessarylogic to interface the PowerPC APU portwith the accelerator itself. You must assignthe new accelerators unique identifiers

(instructions) so that the PowerPC can beinstructed to activate the accelerator bythe software application.

The second step is to modify the originalapplication source code to use the newinstructions, thus achieving the executionspeedup offered by the new hardware accel-erators. In a normal design flow both ofthese steps are manual, although in someESL design flows you can describe the hard-ware accelerators using C/C++/SystemC ora derivative programming language. But thesoftware integration step is always manual.

Realizing the AcceleratorsClarity automates the entire process of cre-ating accelerators for Virtex-4 FX FPGAs,including software integration. This isachieved when you commit the newly dis-

This unique approach of visually repre-senting profiling information in CFGs andDFGs, along with the aforementioned cor-respondence, helps you quickly hone in onand understand the most computationallyintensive parts of the application, as all ofthe views are always synchronized.

Once you have gathered the applicationprofiling information, you can invoke theautomatic accelerator identification searchalgorithm. This algorithm identifies a set ofoptimal hardware accelerators that providethe best execution speedup of the applica-tion given the constraints.

An important constraint on acceleratorsis the number of available data inputs andthe number of data outputs, which in thecurrent Virtex™-4 FX device is fixed bythe PowerPC to two inputs and one out-put. Figure 3 shows the results from twoinvocations of the accelerator search algo-

rithm. The upper table uses a constraint oftwo inputs and one output, with the lowerusing a constraint of four inputs and oneoutput for the application ADPCM (adap-tive differential pulse code modulation).

Clearly, the number of I/Os has a signif-icant impact on the acceleration achieved. Inorder to realize the higher acceleration whileovercoming the hard constraints imposed bythe PowerPC, you can use pipelining tech-niques at the cost of increasing design com-plexity. Clarity automatically pipelinesaccelerators, giving access to the higher per-formance with no extra work for you.

The identified hardware accelerators areoptimal; no DFG of the application consid-ered by Clarity contains a sub-graph withbetter estimated performance. As a conse-quence, if the set of accelerators discovered


Figure 3 – Instructions found with a constraint of 2-1 (top) and 4-1 (bottom), with corresponding performance increases.

Figure 2 – Data flow graph (DFG); eachnode represents a single operation.

Figure 1 – Control flow graph (CFG); each node represents a basic block (DFG).

covered accelerators to the Xilinx platform.The commit stage is fully automatic andcomprises three parts:

• Modifying the original applicationsource code to use the newly generatedaccelerators. The source code of theapplication is modified so that theparts to be implemented in the FPGAare removed and replaced by calls cor-responding to the new hardware accel-erators. Figure 4 shows a snippet ofcode from the ADPCM application, inwhich the removed source code is com-mented out and replaced with a call tothe macro Instr_1. This macro con-tains the special PowerPC instructionused to activate the hardware accelera-tor corresponding to the Instr_1 thatClarity discovered and you selected.

• Generating RTL for each acceleratoralong with the necessary logic to inter-face the Xilinx PowerPC APU and theaccelerator data path. Each accelerator isimplemented in VHDL as an executiondata path along with the requiredPowerPC interfacing logic. The datapath is translated directly from the Csource code and is thus correct by con-struction. This translation is possiblebecause the data path implements a puredata flow with limited control, avoidingthe issues of high-level synthesis. Asdescribed above, if the I/O constraintsspecified for the search exceed those of

the PowerPC architecture, the data pathwill automatically be split into stages andpipelined to fit these constraints.

• Creating a Xilinx Platform Studio (XPS)project containing the RTL implementa-tions and modified application sourcecode. To provide confidence in the cor-rectness and synthesized result of theHDL accelerators, you can have a testbench created automatically for eachaccelerator along with the necessary simulation and synthesis scripts. You can invoke an HDL simulator such asModelSim from within Clarity to com-pare the functionality of the original Ccode with the HDL code of the accelera-tor replacing it. The generated synthesisscript enables you to perform synthesison each accelerator in the target technol-ogy and obtain information on the criti-cal path and area. Furthermore, thisverification test bench provides a frame-work to ensure the correctness of identi-fied accelerators that may be subsequentlymodified by a design engineer.

Hardware Design for Software EngineersThe challenge of identifying hardwareaccelerators for an application is formida-ble, especially if expert domain knowledgeis required but unavailable. This challengeis made more acute by the difficulties inrealizing these accelerators as coprocessors,requiring new interfaces and software inte-

gration to manipulate them. By providinga standardized interface, the Virtex-4 FXPowerPC enables new automation tech-nologies that address both identificationand realization of accelerators.Furthermore, Clarity can automaticallycircumvent the apparent I/O limitationsof the PowerPC APU.

In particular, Clarity offers a uniquesolution to this challenge through an inno-vative combination of automation and visu-alization techniques (see Figure 5). Workingat the level of the C programming language,you can visualize all aspects of the controlflow, data flow, and execution behavior ofan application. A unique hardware accelera-tor identification technology automaticallydiscovers and creates application-specifichardware accelerators targeted to the XilinxVirtex-4 FX device. Fully automatic HDLgeneration, application software integra-tion, and test bench generation mean thatyou are freed from any concerns about howto realize application acceleration in hard-ware, thus empowering you to focus onyour product differentiation.

So finding the H.264 source code on theWeb was not such a bad idea. You createdsome useful accelerators and implementedthem on the Virtex-4 FX device before lunchtime, leaving the afternoon free to exploresome different solutions just for fun.

For more information about Clarity,please e-mail [email protected] orvisit www.mimosys.com/xilinx.

Application Source Application Data Set Architectural Contraints

Analysis Profiling Accelerator Generation

CandidateAccelerators

Interface GenerationVHDL GenerationCode GenerationTest Bench Generation

VerificationTest Bench Modified Application Source

Accelerator VHDLDescription

Accelerator/PowerPCInterface

Synthesis/Place/RouteCompilation

Application Executable FPGA Configuration

PowerPC FPGA Fabric

Virtex-4FX

SDK/EDK

Xilinx XPS Project

MIMOSYS Design Automation Tool

MIMOSYS

XILINX

Figure 5 – Mimosys Clarity flow; from the application source code and some constraints, you can generate a complete Xilinx Platform Studio project with accelerators.

Figure 4 – Modified application; the code moved into the hard-ware accelerator has been commented out and replaced with a

call to the accelerator. The code is from ADPCM.


by Mark GoosmanProduct Marketing ManagerXilinx, [email protected]

Robert Shortt Senior Staff Software EngineerXilinx, [email protected]

David KnolSenior Staff Software EngineerXilinx, [email protected]

Brian JacksonProduct Marketing ManagerXilinx, [email protected]

FPGA design sizes and complexity are nowin the stratosphere. An increasing numberof designers are struggling to meet theirdesign goals in a reasonable amount of time.

Xilinx introduced PlanAhead™ soft-ware as a means to combat lengtheningdesign closure times. The PlanAhead hierar-chical design and analysis tool helps youquickly comprehend, modify, and improveyour designs. Earlier PlanAhead versions(7.1 and earlier) were used to improve per-formance of the design through floorplan-ning. The software does encapsulate Xilinx®

ISE™ back-end tools to complete theEDIF/NGC-to-bitstream flow. However,earlier versions left the complex job ofdesign closure through place and routeoptions to the user.


ExploreAhead Extends the PlanAhead Performance AdvantageExploreAhead Extends the PlanAhead Performance AdvantageBy leveraging computing resources, you can achieve improved design results.

PlanAhead 8.1 design tools introduceExploreAhead, which simplifies the tedioustask of wading through myriad place androute options to obtain the best achievableQoR for a given floorplan. ExploreAheadalso enables an effective use of multi-coreCPU machines to speedup the design con-vergence process.

ExploreAhead is an implementationexploration tool. It manages multiple

implementation runs of your designthrough the NGDBuild, map, and placeand route steps. ExploreAhead allows youto create, save, and share place and routeoptions as “recipes” or “strategies.” WithExploreAhead, you can order multipleimplementation runs based on the strate-gies you defined or on the predefinedstrategies shipped as factory defaults. Theseruns can be parallelized to take advantageof multi-core CPU machines.

ExploreAhead manages reports and statis-tics on the runs, allowing you to pick the bestimplementation for your design. Figure 1shows an illustration of ExploreAhead.

StrategyA “strategy” is defined as a set of place androute options. This is a recipe that you canuse to implement a single place and routerun. Strategies are defined by ISE releaseand encapsulate command-line options foreach of the implementation tools:NGDBuild, map, and place and route.

Using strategies is a very powerful con-cept that makes ad hoc management ofyour place and route options a seamlesstask. ExploreAhead ships with a set of

Expert users are encouraged to craftstrategies suitable for their designs. User-defined strategies are stored under$HOME/.hdi/strategies for Unix users andC:\documents and settings\$HOME\appli-cation data\hdi\strategies for MicrosoftWindows users. These are simply XML filesfor teams of users to share. Design groupswanting to create group-wide custom strate-gies accessible to anyone using PlanAheadsoftware can copy user-defined strategies tothe <InstallDir>/strategies directory.

RunExploreAhead introduces the concept of a“run” object. Once launched, the run willwork through the back-end tools to imple-ment the design. Each run is associatedwith a set of place and route options or adefined strategy. ExploreAhead gives youthe capability to launch multiple imple-mentation runs simultaneously.

Launching an ExploreAhead run is atwo-step process. The first step involvesqueuing up the run with different strate-gies. The second step will actually launchthe place and route tools on each of theruns. The two dialog boxes in Figure 3show the two steps.

Once you have interacted with the“Run ExploreAhead” dialog box and gener-

ated the required set of runs, asummary table of runs appears inthe PlanAhead console window.Figure 1 displays one such table ofruns. Each of the runs is selectable.Selecting a run will display theproperties of this run in thePlanAhead properties window.Selecting one or many runs andhitting the launch button willbring up the launch dialog box.Here ExploreAhead will allow you

to start multiple place and route runssimultaneously on a multi-core CPUmachine. ExploreAhead will push all of therequested place and route runs into a queue

predefined strategies. Xilinx has testedthese predefined strategies and foundthem to be some of the most effectivetechniques to get better performance ondesigns. These factory-default strategiesare prioritized by their effectiveness.Predefined strategies eliminate the needfor you to learn new options each time anew version of ISE software is released toachieve the best QoR.

ExploreAhead also introduces an easy-to-use strategies editor for you to createyour own favorite strategy. Figure 2 showsthe strategies editor.


With ExploreAhead, you can order multiple implementation runs based on the strategies you defined or on the predefined strategies shipped as factory defaults.

These runs can be parallelized to take advantage of multi-core CPU machines.

Figure 1 – Multiple implementation runs with varying strategies

Figure 2 – Strategies editor

Figure 3 – Run ExploreAhead and Launch Runs dialog boxes

and will then launch runs from this queuewhen CPU resources become available.

MonitorExploreAhead has an easy-to-use interfaceto monitor ongoing runs. The “summaryruns” table in the console window helps youquickly browse through the relevant charac-teristics of a run: the CPU time implica-tions of a strategy, percent completion,timing results, and description of the strat-egy employed. In addition to the summaryresults table, there is also a live “run moni-tor” that displays all console messages gen-

erated by place and route tools. You cansimply select a run and tab over to the mon-itor tab of the property dialog box to engagethe console messages. Figure 4 shows theproperties dialog box for a single run.

ReportsThe reports tab in the “RunProperties” dialog box shown inFigure 4 lists all of the potentialreports that could be producedduring an implementation run.These reports are grayed out atthe start of the run and becomeavailable for browsing as the runproceeds through the variousback-end steps. Simply double-click on each of the availablereports to add it to the desktopbrowser.

ResultsOnce the ExploreAhead runs arecomplete and you have all of thereports for all of their runs atyour disposal, you can thendecide to import the results into

PlanAhead software for further investiga-tion. After selecting a run, you can thenright-click to import the run, which willalso allow you to import placement andtiming results into PlanAhead designtools. Figure 5 shows the import run dia-log box.

ProjectPlanAhead 8.1 software introduces thePlanAhead project repository. The PlanAheadproject will save your ExploreAhead runinformation. ExploreAhead acknowledgesthat some of the place and route runs can takea significant amount of run time. If youlaunch a large number of runs, this can alsoadd to your total completion time. As such,PlanAhead software allows you to exit theprogram, allowing the place and route runs tocontinue on your machine. You can thenlaunch PlanAhead design tools at a later time;it will re-open the project, re-engage the mon-itors, and open the summary run tables thereport files. This powerful feature allows youto free up a PlanAhead license during placeand route runs.

ExploreAhead Design FlowYou can employ floorplanning – a keyenabling methodology – within thePlanAhead framework in conjunctionwith ExploreAhead to get the best-in-class

QoR. You can use ExploreAhead on afloorplanned design or on a singlePblock. You can then piece the designtogether using a combination of floor-planning, a block-based implementationapproach, and incremental design tech-niques. ExploreAhead, however, makesno assumptions as to the need to floor-plan a design. The basic ExploreAheaddesign flow, shown in Figure 6, requiresno floorplanning.

ConclusionExploreAhead expands the PlanAheadportfolio to include QoR optimization inthe implementation tools. ExploreAheadbrings together several key technologies tomake PlanAhead design tools an extremelyproductive environment for interactivedesign exploration and closure. The com-bination of multiple what-if floorplans andwhat-if ExploreAhead runs on each ofthese floorplans expands the search spaceof possibilities enormously.

ExploreAhead enables you to searchthrough this large space effectively to pickthe most optimal implementation solutionfor your design.


Create a New Project: � Import EDIF � Select Device � Import Contraints

Run ExploreAhead: � Create Runs � Tweak Strategies and Options

Launch Runs: � Parallelize Runs

Import Results

Figure 4 – Selected run properties: general, place and route options, live monitor, reports

Figure 5 – Import Run dialog box

Figure 6 – Basic PlanAhead/ExploreAheadflow diagram

Æ

by Jay GouldProduct Marketing ManagerXilinx, [email protected]

Which do you want in your next embeddedproject: flexible system elements so that youcan easily customize your specific design, orextra performance headroom in case youneed more horsepower in the developmentcycle? Why put yourself under undue devel-opment pressure and settle for one or theother? Soft processing and customizable IPoffer the best of both worlds, integrating theconcepts of custom design and co-processingperformance acceleration.

Discrete processors offer a fixed selec-tion of peripherals and some kind of per-formance ceiling capped by clockingfrequency. Embedded FPGAs offer plat-forms upon which you can create a system

with a myriad of multiple customizableprocessor cores, flexible peripherals, andeven co-processing offload engines. Younow have the power to design an uncom-promised custom processing system to sat-isfy the most aggressive projectrequirements and punch a hole in that per-formance ceiling, while maximizing sys-tem performance by implementingaccelerated software instructions in theFPGA hardware. With FPGA fabric accel-eration, the sky’s the limit.

FlexibilityIn addition to the high-performancePowerPC™ hard-processing core avail-able in Xilinx® Virtex™ Platform FPGAsand the small footprint PicoBlaze™microcontroller core programmed withassembly language, Xilinx offers you thechoice of designing with a customizable

general-purpose 32-bit RISC processor.The MicroBlaze™ soft processor is highlyflexible because it can be built out of thelogic gates of any of the Virtex orSpartan™ families, and you can customizethe processing IP peripherals to meet yourexact requirements.

With customizable cores and IP, youcreate just the system elements you needwithout wasting silicon resources. Whenyou build a processing system in a pro-grammable device like an FPGA, you willnot waste unused resources in a discretedevice, nor will you run out of limitedperipherals if you require more than whatare offered (say your design requires threeUARTs and your discrete device offersonly one or two). Additionally, you are nottrapped by your initial architectureassumptions; instead, you can continue todramatically modify and tune your system


Designing Flexible, High-Performance Embedded Systems

Designing Flexible, High-Performance Embedded Systems

Soft processing and customizable IP enable the best of both worlds.

Soft processing and customizable IP enable the best of both worlds.

architecture and adapt to changes in newlyrequired features or changing standards.

One FIR filter design example from theworkshop materials for the 2006Embedded System Conference had aMicroBlaze system configured with anoptional internal IEEE 754-compliantfloating-point unit (FPU), which facilitatesa significant performance increase over soft-ware-only execution on the processor core.By including optional MicroBlaze compo-nents, you can quickly improve the per-formance of your application.

A side benefit of these optional internalcomponents is that they are fully supportedby the MicroBlaze C compiler, so source-code changes are unnecessary. In the FIRfilter design example, including the FPUand recompiling the design meant immedi-ate performance improvements, as calls toexternal C library floating-point functionsare automatically replaced with instructionsto use the new FPU.

Utilizing specialized hardware-processingcomponents improves processor performanceby reducing the number of cycles required tocomplete certain tasks by orders of magni-tude over software recoding methods. Thesimple block diagram in Figure 1 represents aMicroBlaze processing system with an inter-nal FPU IP core, local memory, and choice ofother IP peripherals such as a UART or aJTAG debugging port. Because the system iscustomizable, we could very well have imple-mented multiple UARTs or other IP periph-eral cores from the Xilinx processor IPcatalog, including a DMA controller, IIC,CAN, or DDR memory interface.

formance improvement of 48x with nochanges to the C source file.

In a second example, we’ll build on anadditional design module, implementing anIDCT engine for an MP3 decoder applica-tion that will accelerate the application mod-ule by more than an order of magnitude.

You can easily create both processorplatform examples referenced here with adevelopment kit like the one depicted inFigure 2. The Integrated Hardware/SoftwareDevelopment Kit includes a Virtex-4 refer-ence board that directly supports bothPowerPC and MicroBlaze processor designs.The kit also includes all of the compiler andFPGA design tools required, as well as an IPcatalog and pre-verified reference designs.

With the addition of a JTAG probe andsystem cables, the kit allows you to have aworking system up and running right out ofthe box before you start editing and debug-ging your own design changes.Development kits for various devices andboards are available from Xilinx, our distrib-utors, and third-party embedded partners.

Locate Bottlenecks and Implement Co-ProcessingThe MicroBlaze processor, one of EDN’sHot 100 Products of 2005, utilizes the IEC(International Engineering Consortium)

The IP catalog provides a wide varietyof other processing IP (bridges, arbiters,interrupt controllers, GPIO, timers, andmemory controllers) as well as customiza-tion options for each IP core (baud ratesand parity bits, for example) to optimizeelements for feature, performance, andsize/cost. Additionally, you can configurethe processing cores with respect to clockfrequency, debugging modules, local mem-ory size, cache, and other options. By mere-ly turning on the FPU core option, here atXilinx we built a MicroBlaze system thatoptimized the aforementioned FIR imple-mentation from 8.5 million CPU cycles toonly 177,000 CPU cycles, enabling a per-


LocalMemory

UARTDebugModule

FPU

IDCTEngine

Figure 2 – Integrated Hardware/Software Development Kit

Figure 1 – Simple MicroBlaze block diagram

award-winning Xilinx Platform Studio(XPS) embedded tool suite for implement-ing the hardware/IP configuring and soft-ware development. XPS is included in ourpre-configured Embedded DevelopmentKits and is the integrated developmentenvironment (IDE) used for creating thesystem. If you have a common referenceboard or create your own board descriptionfile, then XPS can drive a design wizard toquickly configure your initial system.

Reduce errors and learning curves withintelligent tools so that you can focus yourdesign time on adding value in the end appli-cation. After creating the basic configuration,you can spend your time iterating on the IPto customize your specific system and thendevelop your software applications.

XPS provides a powerful software devel-opment IDE based on the Eclipse frame-work for you power coders. Thisenvironment is ideal for developing,debugging, and profiling code to identifythe performance bottlenecks that hide inotherwise invisible code execution. Theseinefficiencies in the code are often whatmakes a design miss its performancerequirement goals, but they are hard todetect and often even harder to optimize.

Using techniques like “in-lining code” toreduce the overhead of excessive functioncalls, you can improve application perform-ance 1%~5%. But with programmable plat-forms, more powerful design techniques nowexist that can yield performance improve-ments by an order or two in magnitude.

Figure 3 shows a montage of PlatformStudio views for performance analysis. XPSdisplays profiling information in a varietyof forms so that you can identify trends orindividual offending routines that spike onperformance charts. Bar graphs, pie charts,and metric tables make it easy to locate andidentify function and program inefficien-cies so that you can take action to improvethose routines that leverage the most bene-fit for total system performance.

Soft-Processor Cores with Their Own IP BlocksFor the MP3 decode example that Idescribed earlier, we built a custom system(Figure 4), starting with the instantiation ofmultiple MicroBlaze processors. Because the

MicroBlaze processor is a soft-core processor,we can easily build a system with more thanone processor and balance the performanceloading to yield an optimal system.

In Figure 4 you can clearly see the topMicroBlaze block with its own bus andperipheral set separate from the bottomMicroBlaze block and its own, differentperipheral set. The top section of thedesign runs embedded Linux as an OSwith full file system support, enablingaccess to MP3 bitstreams from a network.We offloaded the decoding and playing of

these bitstreams to a second MicroBlazeprocessor design, where we added tightlycoupled processor offload engines for theDCT/IMDCT (forward and inverse mod-ified discrete cosine transform) functionsand two high-precision MAC units.

The IMDCT block covers data com-pression and decompression to reducetransmission-line execution time.DCT/IMDCT are two of the most compu-tationally intense functions in compressionapplications, so moving this whole functionto its own co-processing block greatly


BRAM

BRAM

SystemACE

FPGA

Microblaze

Microblaze

MDM

Timer

Int Controller

UART

Ethernet MAC

ZBT MemCon

ZBT MemCon

ZBT MemCon

18x36IMDCT

32x32DCT

LL_SHMAC1

LL_SHMAC2

AC97Controller

D-Type

PHY

ZBT Mem

ZBT Mem

ZBT Mem

AC97Codec

LMB

LMB

FSL FSL OPB Bus

OPB Bus

FSL FSL FSL FSL

FSL

Figure 4 – MicroBlaze MP3 decoder example

Figure 3 – Platform Studio embedded tool suite


improves overall system performance.Whereas we implemented an internal FPUin the earlier FIR filter example, the MP3example has implemented MicroBlaze cus-tomizations and added external dedicatedhardware within the FPGA.

Co-Processing + Customizable IP = PerformanceBy offloading computationally intense soft-ware functions to co-processing “hardinstructions,” you can develop an optimalbalance for maximizing system perform-ance. Figure 4 also shows a number of IPperipherals for the Linux file system mod-ule, including UART, Ethernet MAC, andmany other memory controller options.The coder/decoder application block, bycomparison, uses different IP customizablefor different system capabilities.

The second MicroBlaze soft core is slavedto the first MicroBlaze processor and acts asa task engine decoding the MP3 bitstream.The decoder algorithm with the addition ofspecific IP cores is connected directly in theFPGA fabric hardware resources through aXilinx Fast Simplex Link (FSL) connectioninterface. This co-processing design tech-nique takes advantage of the parallel andhigh-speed nature of FPGA hardware com-pared to the slower, sequential execution ofinstructions of a stand-alone processor.

Direct linkage to the high-performanceFPGA fabric introduces fast multiply accu-mulate modules (LL_SH MAC1 andLL_SH MAC2 on Figure 4) to comple-ment dedicated IP for the DCT andIMDCT blocks. The long-long MACmodules provide higher precision whileoffloading the processing unit. You willalso notice high-speed FSL links utilizedfor the AC97 controller core to interface toan external AC 97 codec, allowing theimplementation of CD-quality audioinput/output for the MP3 player.

The co-processing system depicted inFigure 4 results in a cumulative 41x per-formance acceleration over the originalsoftware application with the additive seriesof component boosts. Comparing a pure“software-only” implementation (see thetop horizontal bar as illustrated in Figure 5)to each subsequent stage of the hardwareinstruction instantiations, you can see howthe performance improvements add up.Moving software into the IMDCT aloneyields a 1.5x improvement, while addingthe DCT as a hardware instruction movesit up to 1.7x. Another larger improvementis realized by implementing a long-longmultiply accumulate to reach 8.2x.

Implementing all of the software mod-ules in hardware through co-processingtechniques, the total end result rolls up to

yield an amazing 41x improvement – withthe added benefit of reducing the applica-tion code size. Because we removed multi-ple functions requiring a large number ofinstructions and replaced them with a singleinstruction to read or write the FSL port,we have far fewer instructions and thusachieved some code compaction. In theMP3 application section, for example, wesaw a 20% reduction in the code footprint.

Best of all, design changes throughintelligent tools like Platform Studio areeasy, quick, and can still be implementedwell into the product development cycle.Software-only methods of performanceimprovement are time-consuming and usu-ally have a limited return on investment.Balancing the partition of software applica-tion, hardware implementation, and co-processing in a programmable platform,you can attain much more optimal results.

ConclusionBased on the examples described in thisarticle, we were able to easily customize afull embedded processing system, edit theIP for the optimal balance offeature/size/cost, and additionally squeezeout huge performance gains where noneappeared possible. The Virtex-4 andSpartan-3 device families offer flexiblesoft-processor solution options that can bedesigned and refined late into the develop-ment cycle. The award-winning MicoBlazesoft-processor core combined with theaward-winning Platform Studio tool suiteprovides a powerful combination to kick-start your embedded designs.

Co-processing techniques, such as imple-menting computationally intense softwarealgorithms as high-performance FPGAhardware instructions, allow you to acceler-ate your performance of modules by 2x,10x, or as much as 40x+ in our commonindustry example. Imagine what it can dofor your next design – think about the headroom and flexibility available for your designlate in the development cycle, or being ableto proactively plan improvements to thenext generation of your product.

For more information on Xilinxembedded processing solutions, visitwww.xilinx.com/processor.

100 MHz MicroBlaze Processor, Pure Software = 290 Seconds

100 MHz MicroBlaze Processor, + FSL + IMDCT= 190 Seconds

100 MHz MicroBlaze Processor, + FSL + DCT= 170 Seconds

100 MHz MicroBlaze Processor,+ FSL + LL MAC= 35 Seconds

100 MHz MicroBlaze Processor,+ FSL + DCT + IMDCT + LL MAC= 7 Seconds

41X

40X 100X 250X

Cu

sto

m H

ard

war

e L

og

ic

Performance Improvement

Figure 5 – Co-Processing acceleration results

by Jay Gould Product Manager Xilinx, [email protected]

Steve Veneman Product Manager Wind River [email protected]

Have you ever worked on an embeddedprocessing project where your team invest-ed an exorbitant amount of test time onlyto find out at the end of the project thatyour product had a series of intermittentand hard-to-find bugs? You used your soft-ware debugger throughout the develop-ment process and spent man-weeksstepping through the code in your lab, yetwhen your software passed integration test-ing you were stunned to find that your QAteam (or worse yet, a customer) discoveredserious system flaws.

Some embedded bugs are harder to findthan others and require advanced detectionmethods. Testing small units of code in

your office or lab does not fully exerciseyour embedded product the same way aswhen it is fully assembled and deployed inreal-time conditions. Add other modulesfrom other engineers and your simple unittests may still pass for hours at a time.Often bugs will not reveal themselves untilyou run the code for much longer periodsof time or with more complex interactionswith other code modules.

So how do you know if you are reallytesting all your code? Is it possible that acolleague’s software may be overwriting avariable or memory address utilized inyour module?

The more complex your embeddedproducts become, the more sophisticatedyour development and debugging toolsmust scale. Often “test time” is equatedwith “good testing” practices: the longeryou run tests, the better you must be exer-cising the code. But this is often mislead-ing for a number of reasons. Stubbingcode and testing a module in a stand-alone method will miss many of the inter-action mistakes of the rest of the final

system. Running dozens or hundreds of“use cases” may seem powerful, but maycreate a false sense of security if you don’tactually track metrics like code “coverage,”where you pay more attention to exercisingall of the code instead of repeatedly testinga smaller subset.

Standard software debuggers are a usefuland much-needed tool in the embeddeddevelopment process, especially in the earlystages of starting, stepping, and stoppingyour way though brand-new code units.Once you have significant code applicationsdeveloped, testing the system as a wholebecomes much more representative of anend-user product, with all of the interac-tions of a real-time embedded system.However, even a good software debuggermay not improve your chances of findingan intermittent, seldom-occurring bug.Serious code interaction – made more com-plex with asynchronous interrupts of real-world embedded devices – mandates theuse of proper tools to observe the entire sys-tem, creating a more robust validation envi-ronment in which to find the sneakier bugs.


Tracing Your Way to Better Embedded SoftwareTracing Your Way to Better Embedded SoftwareTechniques to isolate and debug critical software issues.Techniques to isolate and debug critical software issues.

Trace Your Code ExecutionOnce your code goes awry, locks up yoursystem, core dumps, or does somethingseriously unexpected, it is time to improvethe visibility of exactly what your softwarewas doing before the flaw. Often the symp-tom of the failure is not clearly tied to thecause at all. Some debuggers or target toolsrunning on the embedded device will diewith a fatal system flaw, so they cannot beused to track that flaw. In these cases, youcan use a separate and external tool with itsown connections and hardware memory toprovide an excellent view of system execu-tion all the way up to a complete failure.“Trace” tools, often associated with JTAGprobes or ICEs (in-circuit emulators), candisplay an abundance of useful informa-tion for looking at software instructionsexecuted, lines of code executed, and evenperformance profiling.

Capturing software execution with atrace probe allows more debugging capabil-ity than just examining a section of code.An intelligent trace tool provides a widevariety of trace and event filters, as well astrigger and capture conditions. One of themost useful situations for trace techniquesis to identify defects that cause target crash-es on a random, unpredictable basis.

Connecting a probe (Figure 1) likeWind River ICE with the integratedWind River Trace tool introduces amethod by which you can configure traceto run until the target crashes. By trigger-ing on “no trigger,” the trace will contin-ue to capture in a circular buffer ofmemory in the probe until the systemcrashes. You may have unsuccessfully triedto manually find a randomly occurringdefect with your software debugger, butwith a smart trace tool, you can now runthe full battery of system tests and walkaway to let the trace run. When that inter-mittent event finally occurs again, crash-ing the target, the tool will capture andupload the data for your examination.

The execution trace will show a deephistory of software instructions executedon your system leading up to the fatal flaw,providing much more insight into what thesystem was doing and allowing you to findand fix the bug faster. Often these prob-

information is next accessed.With advanced tools like Wind River

Trace, you can capture the actual instruc-tion and data information that was execut-ed on the target before the event occurred.Using this information, you can back upthrough the executed code and identifywhat events happened before the crash,identifying the root cause of the problem(see Figure 2). In addition to a chronologi-cally accurate instruction sequence, whenused on a PowerPC™ embedded platformlike Xilinx® Virtex™-II Pro or Virtex-4devices, the Wind River Trace executioninformation also includes specific time-stamp metrics. Coincidentally, the WindRiver ICE actually implements XilinxVirtex-II Pro FPGA devices in the probehardware itself.

Coverage Analysis and Performance ProfilingBecause the trace tool and probe are able toprovide a history of the software execution,they can tell you exactly where you havebeen – and what code you have missed.You miss 100% of the bugs that you don’tlook for, so collecting coverage analysis

lems are not easily found by post-mortemdebugging through register and memoryaccesses. Embedded systems commonlycrash when a function is overwriting a vari-able that should not be accessed/written, orwhen memory is corrupted by executablecode being overwritten by another routine.It is difficult to diagnose these types ofproblems when no information exists onwhat actions led up to an event, and thecrash of the system may not actually takeplace until much later when the corrupted


Figure 2 – Wind River Trace software execution view

Figure 1 – The Wind River ICE JTAG probe implements Virtex FPGAs.

metrics is important to your testing strate-gy. As code execution is almost invisiblewithout an accurate trace tool, it is com-mon for entire blocks or modules of codeto go unexecuted during test routines orredundant user-case suites. Testing thesame function 10 times over may makesense in many cases, but missing a different

function altogether means that you will notfind those bugs in your lab – your customerwill find them in the field.

Coverage metrics showing which func-tions are not executed are useful for writingnew, additional tests or identifying unused“dead” code, perhaps leftover from a previ-ous legacy project. Because many designsare re-ports or updates of previous prod-ucts, many old software routines becomeunnecessary in a new design, but the lega-cy code may be overlooked and notremoved. In applications where code size iscritical, removing dead code reduces bothwaste as well as risk; nobody wants to wan-der into an obsolete software routine thatmay not even interact properly with anewer version of hardware.

In safety-critical applications like avion-ics, medical, automotive, or defense/aero-

space, higher than normal software testingstandards may be mandated. Certainly thesoftware running the ABS brakes in yourcar, a medical device, or an aircraft controlsystem should be tested to a higher degreethan a software game or other non-life-threatening application. In these life-criti-cal applications code must be executed and

tested (covered) down into every functionof software instructions, and that executionmay need to be documented to a formal,higher level governing body.

In many cases code coverage can also beused to analyze errant behavior and theunexpected execution of specific branchesor paths. Erratic performance, failure tocomplete a task, or not executing a routinein a repeatable fashion may be just as inap-propriate in some applications as a fatalcrash or other defect. By tracing functionentries and exits, comprehensive data iscollected on software operations and youcan throw away the highlighter and pro-gram listing (Figure 3).

Because performance is always a con-cern in modern embedded applications,having an accurate way to measure execu-tion times is paramount. The capabilities

built into probes and trace tools allow youto look at the data in a performance view inaddition to the go/no-go code coverageview. By identifying bottlenecks in softwareexecution, you can identify routines thatfail to meet critical timing specifications oroptimize routines for better performance.With the time-stamp information inVirtex-II Pro and Virtex-4 devices, you canalso use trace tools to determine how long asoftware function or a particular section ofcode took to run.

Whether you are trying to find rou-tines that could use some optimizationrecoding (perhaps in-lining code blocksrather than accruing overhead by repeat-edly calling a separate function) or identi-fying a timing flaw in a time-criticalroutine, performance-metric tools arerequired to collect multiple measure-ments. Settling for a couple passes of aroutine with a manual debugger is notsufficient for validating timing specs onimportant code blocks. Additionally, if areal-time system is running an embeddedoperating system, verifying the capabilityof the interrupt service routines (ISRs) isessential to determine how quickly thesystem responds to interrupts and howmuch time it spends handling them.

ConclusionIf critical performance measurements orcode coverage are required before yourembedded product can ship, then you needto used specialized tools to capture and logappropriate metrics. Accurate system tracemetrics are a must if you have ever beenchallenged by a seemingly random, hard-to-reproduce bug that your customers seemto experience regularly but that you cannotduplicate in your office.

Utilizing an accurate hardware probelike Wind River Trace with a powerfultrace/coverage/performance tool suite willprovide you with the data you require.Stop guessing about what is really goingon and gain new visibility into yourembedded applications.

For more information about WindRiver Trace and other products, visitwww.windriver.com/products/development_suite/wind_river_trace/.


Figure 3 – Wind River profiling and coverage views

Now you can see inside your FPGA designs in a way that

will save days of development time.

The FPGA dynamic probe, when combined with an Agilent

Windows-based logic analyzer, allows you to access

different groups of signals inside your FPGA for debug—

without requiring design changes. You’ll increase visibility

into internal FPGA activity by gaining access up to 64

internal signals with each debug pin.

Our newest 16800 Series logic analyzers offer you

unprecedented price-performance in a portable family,

with up to 204 channels, 32M memory depth and a pattern

generator available.

Agilent’s user interface makes our logic analyzers easy

to get up and running. A touch screen or mouse makes

it simple to use, with prices to fit your budget. Optional

soft touch connectorless probing solutions provide

excellent reliability, convenience and the smallest

probing footprint available. Contact Agilent Direct today

to learn more.

U.S. 1-800-829-4444, Ad# 7909Canada 1-877-894-4414, Ad# 7910www.agilent.com/find/logicwww.agilent.com/find/new16903quickquote

©Agilent Technologies, Inc. 2004 Windows is a U.S. registered trademark of Microsoft Corporation

• Increased visibility with FPGA dynamic probe• Intuitive Windows

®XP Pro user interface

• Accurate and reliable probing with soft touch connectorless probes • 16900 Series logic analysis system prices starting at $21,000• 16800 Series portable logic analyzers starting at $9,430

Get a quick quote and/or FREE CD-ROM with video demos showing how you can reduce your development time.

X-ray vision for your designsX-ray vision for your designsAgilent Logic Analyzers with FPGA dynamic probe

New!

MicroBlaze

Support

New!

MicroBlaze

Support

by Patrick CarrierTechnical Marketing EngineerMentor Graphics [email protected]

With the emergence of multi-gigabit-per-second SERDES bus architectures such asPCI Express, Xilinx® RocketIO™ tech-nology has become a vital part of FPGAdesigns. But accompanying these fasterbuses is a resurgence of an old design prob-lem: crosstalk. The source of concern issimilar to what occurs in buses like PCIand PCI-X: a large number of signals allneed to go to the same place, from theFPGA to a connector or to another chip.

The bad news is that PCI Express edgerates are much faster than PCI and PCI-X.The fastest edges can be on the order of 50ps, which equates to around a third of aninch on a printed circuit board. Thisequates to more crosstalk for a givenamount of parallelism, which can be quitelarge for typical PCI Express link lengths.

The good news is that because PCIExpress links comprise unidirectional differ-ential pairs, you only need to controlcrosstalk at the receiver. This characteristicof PCI Express has actually led to two theo-ries on how to implement board routing:interleaving TX and RX pairs or keepinglike pairs together. There is no one correctanswer to this design question; the answerdepends on the characteristics of the design.

FEXT, NEXT, and InterleavingIn order to determine the best method forrouting adjacent PCI Express differentialpairs, you should understand the natureof both forward and reverse crosstalk.Both types of crosstalk are caused by cou-pling between traces; that couplingincludes mutual inductance as well asmutual capacitance.

In forward crosstalk, also referred to asfar-end crosstalk or FEXT, coupled energypropagates onto the “victim” signal in thesame direction as the “aggressor” signal. Assuch, the FEXT pulse has an edge equiva-lent in length to the aggressor signal andcontinues growing in amplitude as the sig-nals propagate down the line. The ampli-tude of FEXT is thus determined by the


To Interleave, or Not to Interleave:That is the Question

To Interleave, or Not to Interleave:That is the QuestionControlling crosstalk in PCI Express and other RocketIO implementations.Controlling crosstalk in PCI Express and other RocketIO implementations.

length of parallelism between adjacenttraces and the amount of coupling, as well asthe balance between the inductive andcapacitive components of coupling.

In reverse crosstalk, also known as near-end crosstalk or NEXT, the coupled energypropagates in the opposite direction of theaggressor signal. As such, its amplitude isalso length-dependent, but saturates for anycoupled length greater than the signal edgerate. In NEXT, the inductive and capacitivecoupling are additive.

Figure 1 shows examples of these two typesof crosstalk in the Mentor GraphicsHyperLynx oscilloscope. Notice how the for-ward crosstalk (shown in light blue) is a largespike whose leading edge has an edge rateequivalent to that of the aggressor signal (inthis case 100 ps). The reverse crosstalk (shownin yellow) is much smaller in amplitude, witha width equal to twice the total line length (inthis case, 5 inches or 712 ps). The topologyused to generate these waveforms inHyperLynx LineSim is depicted in Figure 2.

You can reduce the coupling betweentraces most effectively by increasing the spac-ing between them. Because PCI Express com-prises unidirectional differential pairs, you canfurther control crosstalk by altering theaggressors’ direction to allow for only reversecrosstalk or only forward crosstalk.

For trace configurations where the forwardcrosstalk exceeds the reverse crosstalk, the pre-ferred method of routing differential pairswould be to interleave them. If a TX signal isplaced adjacent to an RX signal, the forwardcrosstalk created by that signal will go towardsthe transmitter of the RX signal, where it isnot of concern. The reverse crosstalk createdby the TX signal will go towards the receiverof the RX signal.

Conversely, if the reverse crosstalk exceedsthe forward crosstalk, non-interleaved routingwould be preferable; all forward crosstalkwould be directed towards the victims’receivers instead of the reverse crosstalk.

Microstrip and Stripline Crosstalk AnalysisTo determine when each method of routing isappropriate, let’s examine both inner andouter layer routing in simulation with a topol-ogy similar to that shown in Figure 2. In thefirst example, a set of three microstrip differ-

make a general rule that PCI Express pairsshould be routed as interleaved when beingrouted on microstrip.

In the second example, three stripline(inner-layer) differential pairs, with 4-milwide copper traces 2 mils thick separated by5.0 mils, in a dual-stripline configurationwith dielectric heights of 4, 20, and 4 mils,all modeled with an Er = 3.5, have theirspacing and lengths varied in simulation toobtain the information shown in Table 2.

ential pairs, with 5-mil wide copper traces 2mils thick separated by 5.7 mils, sitting ona 5-mil dielectric with an Er = 4, coveredwith a 0.5-mil solder mask with an Er = 3.3,have their spacing and lengths varied insimulation to obtain the informationshown in Table 1.

In microstrip routing, far-end crosstalkdominates, except in the very rare case ofsuper-short routing and super-tight spac-ing. As such, it would be reasonable to


Figure 1 – Forward and reverse crosstalk

Figure 2 – Topology used for crosstalk analysis

Notice that the far-end crosstalk is 0mV. This is because of the balance thatexists between the counteracting capaci-tive and inductive couplings on thehomogenous stripline. This might leadyou to believe that non-interleaved rout-ing should always be used on thestripline. However, that would be anerroneous assumption. Homogenousstripline rarely exists on a real printed cir-cuit board. There are differences indielectric constants between cores andprepregs, the two types of dielectrics thatmake up a stripline. Furthermore, thereare resin-rich areas localized to the tracesthat alter the dielectic constant aroundthe traces. To properly determinewhether or not to interleave striplinetraces, you must appropriately model thedielectric constants.

The third example uses the same dualstripline trace configuration, except thatthe two 4-mil dielectrics are assigned an Er= 3.5, the 20-mil dielectric an Er = 4.5, andthe thin layer of dielectric on the signallayer is given an Er = 3.2. Differential pairspacings and lengths are varied to producethe results shown in Table 3.

From this data, it is evident that onlonger buses, or buses with longer amountsof parallelism, you should use interleavedrouting because far-end crosstalk domi-nates. Near-end crosstalk dominates forshorter routes with tighter spacings. Thepoint at which the far-end crosstalk exceedsthe near-end crosstalk varies based onlength, spacing, and trace configuration.That crossover point is what you shoulduse to determine whether or not to inter-leave TX and RX differential pairs.

ConclusionClearly, the majority of board routing imple-mentations for PCI Express and similarRocketIO implementations requires that youinterleave TX and RX differential pairs in thelayout. This is certainly true for microstrip,where forward crosstalk dominates. It is alsotrue for stripline, with the exception oftighter spacing with shorter lengths. That isbecause forward crosstalk is non-zero in anactual stripline, and will in many casesexceed reverse crosstalk. Simulation, withproper modeling of the stripline dielectrics,reveals this phenomenon, while pinpointingthe trace configuration where the FEXT willexceed the NEXT and indicate whether ornot to interleave.

To learn more about designing withRocketIO technology, visit www.xilinx.com/serialdesign and www.mentor.com/highspeed.


1H 2H 3H 5H

1 in. NEXT = 75 mV, FEXT = 25 mV NEXT = 27.9 mV, FEXT = 18 mV NEXT = 12.7 mV, FEXT = 12 mV NEXT = 3.7 mV, FEXT = 5.4 mV2 in. NEXT = 75 mV, FEXT = 47 mV NEXT = 27.9 mV, FEXT = 37.8 mV NEXT = 12.7 mV, FEXT = 24 mV NEXT = 3.7 mV, FEXT = 10.5 mV5 in. NEXT = 75 mV, FEXT = 125 mV NEXT = 27.9 mV, FEXT = 94 mV NEXT = 12.7 mV, FEXT = 62 mV NEXT = 3.7 mV, FEXT = 26 mV

10 in. NEXT = 75 mV, FEXT = 255 mV NEXT = 27.9 mV, FEXT = 192 mV NEXT = 12.7 mV, FEXT = 126 mV NEXT = 3.7 mV, FEXT = 53 mV20 in. NEXT = 75 mV, FEXT = 508 mV NEXT = 27.9 mV, FEXT = 381 mV NEXT = 12.7 mV, FEXT = 254 mV NEXT = 3.7 mV, FEXT = 105 mV30 in. NEXT = 75 mV, FEXT = 625 mV NEXT = 27.9 mV, FEXT = 555 mV NEXT = 12.7 mV, FEXT = 379 mV NEXT = 3.7 mV, FEXT = 158 mV

1H 2H 3H 5H

1 in. NEXT = 72.1 mV, FEXT = 0 mV NEXT = 29.9 mV, FEXT = 0 mV NEXT = 14.5 mV, FEXT = 0 mV NEXT = 4.5 mV, FEXT = 0 mV2 in. NEXT = 72.1 mV, FEXT = 0 mV NEXT = 29.9 mV, FEXT = 0 mV NEXT = 14.5 mV, FEXT = 0 mV NEXT = 4.5 mV, FEXT = 0 mV5 in. NEXT = 72.1 mV, FEXT = 0 mV NEXT = 29.9 mV, FEXT = 0 mV NEXT = 14.5 mV, FEXT = 0 mV NEXT = 4.5 mV, FEXT = 0 mV

10 in. NEXT = 72.1 mV, FEXT = 0 mV NEXT = 29.9 mV, FEXT = 0 mV NEXT = 14.5 mV, FEXT = 0 mV NEXT = 4.5 mV, FEXT = 0 mV20 in. NEXT = 72.1 mV, FEXT = 0 mV NEXT = 29.9 mV, FEXT = 0 mV NEXT = 14.5 mV, FEXT = 0 mV NEXT = 4.5 mV, FEXT = 0 mV30 in. NEXT = 72.1 mV, FEXT = 0 mV NEXT = 29.9 mV, FEXT = 0 mV NEXT = 14.5 mV, FEXT = 0 mV NEXT = 4.5 mV, FEXT = 0 mV

1H 2H 3H 5H

1 in. NEXT = 78.4 mV, FEXT = 5.2 mV NEXT = 34 mV, FEXT = 5.2 mV NEXT = 17.2 mV, FEXT = 3.5 mV NEXT = 5.6 mV, FEXT = 3.6 mV2 in. NEXT = 78.4 mV, FEXT = 9.5 mV NEXT = 34 mV, FEXT = 10.1 mV NEXT = 17.2 mV, FEXT = 7 mV NEXT = 5.6 mV, FEXT = 3.6 mV5 in. NEXT = 78.4 mV, FEXT = 18.2 mV NEXT = 34 mV, FEXT = 21.5 mV NEXT = 17.2 mV, FEXT = 17.4 mV NEXT = 5.6 mV, FEXT = 10.8 mV

10 in. NEXT = 78.4 mV, FEXT = 34.8 mV NEXT = 34 mV, FEXT = 42.8 mV NEXT = 17.2 mV, FEXT = 34.6 mV NEXT = 5.6 mV, FEXT = 19.7 mV20 in. NEXT = 78.4 mV, FEXT = 67.1 mV NEXT = 34 mV, FEXT = 82.1 mV NEXT = 17.2 mV, FEXT = 70.5 mV NEXT = 5.6 mV, FEXT = 39.2 mV30 in. NEXT = 78.4 mV, FEXT = 101.4 mV NEXT = 34 mV, FEXT = 124.8 mV NEXT = 17.2 mV, FEXT = 104 mV NEXT = 5.6 mV, FEXT = 58.6 mV

Table 1 – Crosstalk in terms of differential pair spacing versus trace length for a microstrip configuration

Table 2 – Crosstalk in terms of differential pair spacing versus trace length for a homogenous stripline configuration

Table 3 – Crosstalk in terms of differential pair spacing versus trace length for a realistic stripline configuration

SystemBIST™ enables FPGA Configuration that is less filling for your PCB area, less filling for your BOM budget and less filling for your prototypeschedule. All the things you want less of and more of the things you do want for your PCB – like embedded JTAG tests and CPLD reconfiguration.

Typical FPGA configuration devices blindly “throw bits” at your FPGAs at power-up. SystemBIST is different – so different it has three US patentsgranted and more pending. SystemBIST’s associated software tools enable you to develop a complex power-up FPGA strategy and validate it.Using an interactive GUI, you determine what SystemBIST does in the event of a failure, what to program into the FPGA when that daughterboardis missing, or which FPGA bitstreams should be locked from further updates. You can easily add PCB 1149.1/JTAG tests to lower your down-stream production costs and enable in-the-field self-test. Some capabilities:

� User defined FPGA configuration/CPLD re-configuration

� Run Anytime-Anywhere embedded JTAG tests

� Add new FPGA designs to your products in the field

� “Failsafe” configuration – in the field FPGA updates without risk

� Small memory footprint offers lowest cost per bit FPGA configuration

� Smaller PCB real-estate, lower parts cost compared to other methods

� Industry proven software tools enable you to get-it-right before you embed

� FLASH memory locking and fast re-programming

� New: At-speed DDR and RocketIO™ MGT tests for V4/V2

If your design team is using PROMS, CPLD & FLASH or CPU and in-house software to configure FPGAs please visit our website at http://www.intellitech.com/xcell.asp to learn more.

Copyright © 2006 Intellitech Corp. All rights reserved. SystemBIST™ is a trademark of Intellitech Corporation. RocketIO™ is a registered trademark of Xilinx Corporation.

©2006 Mentor Graphics Corporation. All Rights Reserved. Mentor Graphics, Accelerated Technology, Nucleus is a registeredtrademarks of Mentor Graphics Corporation. All other trademarks and registered trademarks are property of their respective owners.

Accelerated Technology, A Mentor Graphics Division [email protected] • www.AcceleratedTechnology.com

With a rich debugging environment and a small, auto-

configurable RTOS, Accelerated Technology provides the

ultimate embedded solution for MicroBlaze development.

Our scalable Nucleus RTOS and the Eclipse-based EDGE

development tools are optimized for MicroBlaze and

PowerPC processors, so multi-core development is easier

and faster than ever before. Combined with world-

renowned customer support, our comprehensive product

line is exactly what you need to develop your FPGA-based

device and get it to market before the competition!

For more information on the Nucleus complete solution, go towww.acceleratedtechnology.com/xilinx

The Ultimate Embedded Solutionfor Xilinx MicroBlaze™

Ravi Kiran KaranamResearch AssistantDepartment of Electrical and Computer EngineeringUniversity of North Carolina at [email protected]

Arun RavindranAssistant ProfessorDepartment of Electrical and Computer EngineeringUniversity of North Carolina at [email protected]

Arindam MukherjeeAssistant ProfessorDepartment of Electrical and Computer EngineeringUniversity of North Carolina at [email protected]

Cynthia GibasAssociate Professor Department of Computer ScienceUniversity of North Carolina at [email protected]

Anthony Barry WilkinsonProfessorDepartment of Computer Science University of North Carolina at [email protected]


Using FPGA-Based Hybrid Computers for Bioinformatics ApplicationsUsing FPGA-Based Hybrid Computers for Bioinformatics ApplicationsSeamless integration of FPGAs into grid computing infrastructures is key to the adoption of FPGAs by bioinformaticians.Seamless integration of FPGAs into grid computing infrastructures is key to the adoption of FPGAs by bioinformaticians.

The past decade has witnessed an explosivegrowth of data from fields in the biologydomain, including genome sequencing andexpression projects, proteomics, proteinstructure determination, and cellular regu-latory mechanisms, as well as biomedicalspecialties that focus on the digitization andintegration of patient information, test, andimage records. Bioinformatics refers to thestorage, analysis, and simulation of biologi-cal information and the prediction of exper-imental outcomes.

To address the computing and data man-agement needs in bioinformatics, the tradi-tional approach has been to use clusters oflow-cost workstations capable of deliveringgigaflops of computing power. However,microprocessors have general-purpose com-puting architectures, and are not necessarilywell suited to deliver the teraflops of high-performance capability required for computeor data-intensive applications. The recentavailability of off-the-shelf high-performanceFPGAs – such as Xilinx® Virtex™-II Prodevices with on-board high capacity memo-ry banks – has changed the computing para-digm by enabling high-speed processing andhigh-bandwidth memory access.

Nallatech offers multi-FPGA comput-ing cards containing between one andseven Virtex FPGAs per card. Several suchcards are plugged into the PCI slots of adesktop workstation and networked usingNallatech’s DIMEtalk networking software,greatly increasing the available computingcapability. This is our concept of a hybridcomputing platform. The hybrid platformcomprises several workstations in a clusterthat you can integrate with Nallatechmulti-FPGA cards to construct a high-per-formance computing system.

Grid ComputingGrid computing is a form of distributedcomputing that employs geographicallydistributed and interconnected computingsites for high-performance computing andresource sharing. It promotes the establish-ment of so-called virtual organizations –teams of people from different organiza-tions working together on a common goal,sharing computing resources and possiblyexperiment equipment.

devices), the BenBlue-II can providemore than 200,000 logic cells on asingle module.

• Multi-FPGA management – DIMEtalkTo manage the large silicon resourcepool provided by the hardware, theNallatech DIMEtalk tool acceleratesthe design flow for creating a reconfig-urable data network by providing acommunications channel betweenFPGAs and the host user environment.

• FUSE Tcl/Tk control and C++ APIsNallatech’s FUSE is a reconfigurableoperating system that allows flexibleand scalable control of the FPGA net-work directly from applications usingthe C++ development API, which iscomplemented by a Tcl/Tk toolset forscripting base control.

DNA Microarray Design – A Case StudyOur goal was to accelerate the Smith-Waterman implementation in theEMBOSS suite of publicly available bioin-formatics code. The Smith-Waterman algo-rithm is widely used to screen genedatabases for sequence similarity, withmany different applications in bioinfor-matics research. Smith-Waterman is specif-ically used in situations where fasterheuristic methods fail to detect somepotentially meaningful sequence hits.

Dr. Cynthia Gibas of the BioinformaticsCenter at UNC Charlotte currently useswater.c, the Smith-Waterman implementa-tion in the open-source EMBOSS software,as a part of a DNA microarray design workflow. The biology goal is to select the opti-mal probe sequences to be printed on aDNA microarray, which will then be used inthe lab to detect individual gene transcriptsin a target mixture with high specificity.

Hardware/Software Partitioning The EMBOSS implementation of theSmith-Waterman (water.c) is an extensiveC program comprising more than 400functions. A partitioning strategy isrequired to identify the functions that needto be implemented on the FPGA and thosethat remain in software (and run on theprocessor). The partition is done by profil-

In recent years, grid computing hasbecome increasingly popular for tacklingdifficult bioinformatics problems. The riseof “bio-grids” (such as the NIH CancerBiomedical Informatics Grid and SwissBiogrid) is driven by the increasinglyenormous datasets and computationalcomplexity of the algorithms involved.Computational grids allow researchers todevelop a bioinformatics workflow locallyand then use the grid to identify and exe-cute tasks seamlessly across diverse com-puting platforms.

To integrate the computing power ofFPGA-based hybrid computing platformswith the existing computing infrastructureused by the bioinformatics community, wehave grid-enabled the hybrid platformusing the open-source Globus Toolkit.Thus, the hybrid platform and the associat-ed software are available as a grid resourceso that bioinformatics applications can berun over the grid. The hybrid platform par-titions the tasks to be run on processors andFPGAs and uses application program inter-faces (APIs) to transfer data to and from theFPGAs for accelerated computation.Bioinformaticians can now take advantageof the computing power of our FPGAhybrid computing platform in a transpar-ent fashion.

Hybrid Computing PlatformThe different FPGA related components inthe hybrid platform are:

• High-capacity motherboard – BenNUEYThe Nallatech BenNUEY mother-board features a Xilinx Virtex-II ProFPGA and module sites for as manyas six additional FPGAs. ThePCI/control and low-level driversabstract the PCI interfacing, resultingin a simplified design process fordesigns/applications.

• Virtex-II expansion module – BenBlue-IIThe Nallatech BenBlue-II DIME-IImodule provides a substantial logicresource ideal for implementing appli-cations that have a large number ofprocessing elements. Through supportfor as many as two on-board XilinxVirtex-II Pro FPGAs (XC2VP100


ing the execution characteristics of thecode. First, the legacy C code in the water.cfile is profiled using the Linux gprof tool.Profiling tells us where a given code spendstime during execution, as well as differentfunctional dependencies.

Table 1 shows the results of profiling interms of the execution times of the top fivefunctions executed by the code, listed indecreasing order of execution time. Notethat one function, embAlignPathCalcSW,accounts for 83% of the total amount ofprogram execution time. TheembAlignPathCalcSW function uses theSmith-Waterman-based local alignmentalgorithm to create a “path matrix” con-taining local alignment scores of compar-ing probe and database sequences atdifferent matrix locations, and a compassvariable to show which partial result is usedto compute the score at a certain location.

Once the code profiling is done, the com-putationally intense embAlignPathCalcSWcall is mapped to the FPGA network usingVHDL, while the rest of the code is run onthe processor. Calls to the computationallyintense embAlignPathCalcSW function inthe C code of the water.c file are thenreplaced with corresponding applicationprogram interface (API) calls to the FPGAnetwork. These APIs transfer data betweenthe FPGA network and the processor, suchthat the calculation of the scores in thepath matrix is done inside the FPGA net-work. All other parts of the code, includingbacktracking, are executed on the processor

in software. A flow diagram of these steps isshown in Figure 1.

Hardware ImplementationWhen comparing a target sequence (T)from a database with a probe sequence (P),the scores and the compass values of thepath matrix in the embAlignPathCalcSWfunction are calculated using a systolicarray of basic processing elements (PEs).

Figure 2 shows the systolic array imple-mentation of the path matrix algorithm forSmith-Waterman with three PEs. Each PEpasses the score and compass values it cal-culates to the successive PE, which in turnuses these values to calculate its path andcompass values. At each clock cycle thepath and the compass values are stored inthe block RAM. At the end of the compu-tation, each block RAM has the correspon-ding row of the path matrix stored in it.

The output of the computationally inten-sive function involves huge amounts of datatransfer (the order of probe length times thetarget length). As a result, the improvementsachieved in computing performance arecompromised by communication latency. Todecrease communication latency, two addi-tional functions (embAlignScoreCalcSWand embAlignWalkSWMatrix) were movedto the FPGA from software. These functionsdo the backtracking and calculate the scoreand alignment sequence for the given probeand target. The functions operate on the pathmatrix and calculate the maximum pathvalue. Then, starting at the location of themaximum path value, the functions back-


Each sample counts as 0.01 seconds.

Percentage Cumulative Self Time Function ms/call ms/call Name of FunctionTime Time (sec) (sec) Calls (self) (self)

83.32 178.94 178.94 32768 5.46 5.65 embAlignPathCalcSW

5.12 189.94 11.00 32768 0.34 0.35 embAlignWalkSWMatrix

4.62 199.87 9.93 32768 0.30 0.30 embAlignScoreSWMatrix

2.92 206.13 6.26 2719612928 0.00 0.00 ajSeqCvtK

1.86 210.13 4.00 53936723 0.00 0.00 match

Profile the Smith-Waterman Algorithm (C/C++)

Design the Most Computational Function/FPGA Network (VHDL)

Utilize API to Merge Smith-Waterman Algorithm (C/C++)

Path and Compass

Output

Tristate BufferData IN

MultiplexersINPUT

Sequence

Block

RAM

Block

RAM

Block

RAM

ADDR

Addr SEL

Addr SEL

Addr SEL

Delay

Delay

Processing Element

Processing Element

Processing Element

ADDR

ADDR

ADDROutput Control

Element

Data IN

Data IN

Figure 2 – Systolic array of three PEs for the path matrix calculation

Figure 1 – Design flow for accelerating water.c on the FPGA-processor hybrid computer

Table 1 – Profiling results of the EMBOSS Smith-Waterman implementation


track through the path values in the pathmatrix based on the compass values to deter-mine the best alignment. The output of theFPGA is now the score and the alignmentsequence, which is only about the size of theprobe sequence, thus greatly reducing com-munication latency.

Grid-Enabling the Hybrid Computing PlatformThe first step to grid-enable our resource wasto install the components of the grid softwareutility package called Globus Toolkit 4.0(GT4). The GT4 is an open-source referenceimplementation of key grid protocols. Thetoolkit includes software for security, infor-mation infrastructure, resource manage-ment, data management, communication,fault detection, and portability.

The core of GT4, the open grid serv-ices architecture (OGSA), is an integra-tion of grid technologies and Web servicetechnologies. Above the OGSA, the jobsubmission and control is performed bythe grid resource allocation and manage-ment (GRAM). GRAM provides a singleinterface for requesting and using remotesystem resources for the execution of“jobs.” The Globus security interfaceenables the authentication and authoriza-tion of users and groups in the gridthrough the exchange of signed certifi-cates. Certificates are created and main-tained by a tool called SimpleCA in GT4.

Installation and Integration with the UNC-Charlotte VisualGridTo test our installation in a real grid, wetook advantage of a new grid project calledVisualGrid, a collaborative project betweenUNC-Charlotte, UNC-Asheville, and theEnvironmental Protection Agency. TheFPGA-based hybrid computing platform isadded as a client to VisualGrid through themaster CA server.

ResultsWe implemented a prototype accelerationtask comparing a 40-nucleotide probesequence with target database sizes of asmany as 1,048,576 targets of an approximatelength of 850 nucleotides both in software(processor) and an FPGA hybrid computingplatform. The probe sequence and the target

database reside in the main memory of thehost computer (dual Opteron-based SunJava W1100z workstation).

For each call to the embAlign-PathCalcSW, embAlignScoreCalcSW, andembAlignWalkSWMatrix functions fromwater.c, a 32-bit probe and target combina-tion is sent over the PCI interface to the PCIFIFO on the Nallatech BenNUEY mother-board at a rate of 33 MHz. From the PCIFIFO, the data is transferred to a 32-bit,512-word-long input FIFO on the Virtex-II

Pro FPGA of the BenNUEY motherboard.The systolic array reads the data from thisFIFO. The 40-processing-element-deep sys-tolic array operates at 40 MHz.

After the systolic array completes pro-cessing on a single string, the output val-ues from the block RAM are used tocalculate the alignment. The alignmentresults are then written back to a differentblock RAM. Finally, the host processorreads the output alignment string fromthe block RAM over the PCI interface.

The hybrid computing platform wasaccessed through the VisualGrid with jobssubmitted through the UNC-CharlotteVisualGrid Portal. The EMBOSS water.cprogram ran in software on the worksta-tion and separately on the FPGA-basedhybrid computing platform. Figure 3shows the comparison between the run

times of the two implementations. Forsmall databases, the processing times ofthe processor and the hybrid computingplatforms are comparable. However, asdatabases get larger, the processor com-puting time rises exponentially, while thehybrid computing platform shows a linearincrease. For a database size of 1,048,576strings, the hybrid computing platform is44 times faster than execution on aprocessor. Such large databases are com-mon in bioinformatics applications.

ConclusionWe have demonstrated the computingpotential of a grid-enabled hybrid com-puting platform for sequence-matchingapplications in bioinformatics. A carefulpartitioning of the computing taskbetween the processor and the FPGAs onthe hybrid platform circumvented thepotential I/O bottleneck associated withlarge bioinformatics databases. The abili-ty to submit jobs over the grids enablesready integration and use of the hybridcomputing platforms in bioinformaticsgrids. We are currently involved in inte-grating the hybrid computing platformon the large inter-university SURAGrid(www.sura.org/SURAgrid).

This research was partly funded by NSF AwardNumber 0453916, and by Nallatech Ltd.

0

10000

20000

30000

40000

50000

60000

256

512

1024

2048

4096

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Strings in Database

Exe

cuti

on

Tim

e (s

eco

nd

s)

Processor + FPGAProcessor

Figure 3 – Execution time for water.c for different sizes of sequence database strings running on a processor and a processor + FPGA hybrid system

by Dan McLaneVice President Innovative Integration Inc. [email protected]

Amit ManeSenior EngineerInnovative Integration [email protected]

Pingli KaoSenior EngineerInnovative Integration [email protected]

Innovative Integration (II) Inc. has devel-oped a powerful solution for multi-chan-nel software-defined radio (SDR). The IIdigital radio receiver (DRR) has a wide-bandwidth IF front-end digitizer integrat-ed with Xilinx® Virtex™-II Pro FPGAs.This configuration allows you to realizethe full flexibility of scalable programma-ble design with high-performance signalprocessing hardware.


II DRR Delivers High-Performance SDRII DRR Delivers High-Performance SDRInnovative Integration utilizes Virtex-II Pro FPGAs for a software-defined radio receiver system.Innovative Integration utilizes Virtex-II Pro FPGAs for a software-defined radio receiver system.

You can customize and optimize the IIDRR for multiple applications using the IISDR reference design. The reference designincorporates a MATLAB Simulink envi-ronment with Xilinx System Generator forDSP. In the System Generator tool, data isbit- and cycle-true and reflects the per-formance of the real system. You can easilymodify the characteristics of the system bychanging the parameters in the blocksets.You can then verify the blocksets in real-time simulations. Four Texas Instruments

(TI) 1 GHz TMS320C6416 DSPs supplyample calculating power for advancedoperations such as signal encoding, decod-ing, and compression.

Anatomy of a Digital Down ConverterIn the II DRR system, a digital downconverter (DDC) decimates the RF to IFsignal, providing signal compensationand shaping. The DDC comprises a cas-caded integrator-comb (CIC) filter, acompensation filter (CFIR), and a pro-grammable filter (PFIR).

The CIC filter is useful to realize largesample rate changes in digital systems. ACIC filter is a “multiplier-less” structurethat comprises only adders, subtractors,and registers, which facilitates hardwareimplementation. The compensation filterflattens the passband frequency response.The programmable low-pass filter lowers

When the specified filter is simulated,the passband ripple is within -0.8 dB, asshown in Figure 3a. Likewise, the magni-tude is down to -40 dB at 0.544 MHz andbelow -90 dB after 1.365 MHz, as shownin Figure 3b.

The channel filter design verified in thesimulation is bit- and cycle-true – so theDDC design matches the theoreticalexpected response shown in Figure 2.

the magnitude of the ripples produced bythe CIC filter.

Figure 1 shows the entire SDR signalprocessing for the DDC. The channel fil-tering is built in the MATLAB Simulinkenvironment using the SDR blockset. Theinput signal is tuned to the target frequen-cy using a mixer and a direct digital syn-thesizer (DDS). The signal then flowsthrough the CIC, CFIR, and PFIR.

You can implement the SDR directly inhardware by using the Xilinx System

Generator tool to generate the signal pro-cessing logic that is fit into the Virtex-IIPro FPGA using Xilinx ISE™ software.The whole system can be designed anddownloaded to hardware in hours, whicheffectively shortens time to market.

Designing Channel FiltersUsing MATLAB’s Filter Design andAnalysis tool (FDATool) and Innovative’sFrameWork Logic software, you can easilydesign and optimize your desired filters.

Consider a GSM system where the filterspecification is:

Fs/2=32.5 MHz, Fpass=0.49 MHz,Fstop1=0.542 MHz, Fstop2=1.35 MHz

The sampling frequency is 130 MHz andthe decimation factor is 120. Figure 2shows the theoretical system response from0 Hz to 65 MHz.


Figure 1 – Simulink block diagram of DDC using Xilinx System Generator software

Figure 2 – Theoretical frequency response of DDC of R = 120 from FDATool program

Figure 3(a) – Frequency response at the corner of Fpass = 0.490 MHz

Figure 3(b) – Frequency responsewithin 0.5 MHz

Using MATLAB’s Filter Design and Analysis tool (FDATool) and Innovative’s FrameWork Logic software, you can easily

design and optimize your desired filters.

II DRR Implemented in HardwareThe II DRR system comprises one QuadiaDSP card and two ultra-wideband (UWB)PMC I/O modules that can perform digitaldown conversions and channel filtering on40 channels simultaneously. The high-speed, front-end signal processing for theDRR is implemented in Virtex-II ProXC2VP40 FPGAs on the Quadia card andUWB modules, as shown in Figures 4 and 5.

The Quadia card has four TI 1 GHzTMS320C6416 DSPs that are used forbaseband processing. The DRR has 40independently tuned channels that delivercaptured in-phase and quadrature (IQ)data to the DSPs. The FPGAs implement16 MB data queues for each channel withflow control to the DSPs. This configura-tion allows the DRR to efficiently processthe large data rates of the DRR system.

The II DRR system was tested to showchannel filter response, system frequencyresponse, and signal quality – and demon-strated excellent results. The results closelymatched the theoretical performance pre-dicted by the MATLAB simulations. Eachoutput channel has a tuning resolution of10 kHz and a noise floor of around -90 dB.The system dynamic range was shown tobe greater than 80 dB.

Customize Your ApplicationsMore than 40 percent of the Virtex-II Prologic blocks and four high-throughput

DSP chips are available for your customapplications. The software provided for theII DRR system supports the complete con-figuration of your system. The softwarealso supports using the DSPs on theQuadia board for channel baseband pro-cessing.

The DRR system supports either a sin-gle pair of DSPs or a full configurationwith four DSPs and two UWB modules. Italso supports logic and DSP programdownloads to all of the devices in the sys-tem through a host PCI. This allowsdynamic reconfiguration of the system formulti-protocol applications.

C++ development libraries on both thehost and target DSPs are provided to per-form hardware and data-flow management.Most system operations – including datatransfer from target to host, loading system

logic, and DSP common object file format(COFF) file downloading – can be per-formed with a single function call usingInnovative’s Malibu host library. DSP func-tions, including device drivers for the DRRinterface and controls, are included inInnovative’s Pismo library and run underTI’s DSP/BIOS RTOS.

An advantage of the Malibu library isthat the code to control the baseboards isstandard C++ code, and the library isportable between different compilers. TheMalibu library supports Borland C++Builder 6 and Microsoft Visual Studio 2003.

The configuration software allows thesetup of each channel on the DRR. You canconfigure each channel to have its own A/Dinput, tuning frequency, gain, and spectralinversion. Tuning frequencies are saved rel-ative to a base frequency. This base fre-quency can be measured in calibration andloaded into the program to allow selectionof precise tuning frequencies. All configura-tions and settings may be saved andreloaded for convenience.

Once the configuration is set, you candownload selected target programs to thedesignated DSPs. A global trigger to thesystem for all channels begins data flow inthe system. The DSPs continuouslyprocess data from the DRR and deliverthe data to the host.

ConclusionInnovative Integration’s digital radio receiv-er system is a practical multi-channel SDRapplication that takes you step-by-stepfrom requirements to implementation.Using MATLAB Simulink DSP tools todefine specific digital signal processing, youcan directly – and quickly – implementyour SDR model into Xilinx Virtex-II ProFPGAs and TI DSPs using Xilinx SystemGenerator and ISE software. The finalhardware is realized on Quadia DSP cardsand UWB PMC modules usingInnovative’s comprehensive FrameWorkLogic system and DSP software support.

Everything you need to produce aworking SDR application based on the IIDRR system model is available throughthe Innovative Integration website atwww.innovative-dsp.com.


Figure 4 – Quadia DSP/FPGA card with dual Xilinx FPGAs

and four TI DSPs

Figure 5 – UWB PMC module with dual 250 MSPS A/Ds

Now, There’s A Flow You Could Get Used To.

©2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. MATLAB and SimuLink are registered trademarks of The MathWorks, Inc.All other trademarks are the property of their respective owners.

www.xilinx.com/dsp

Sidestep tedious hardware development tasks with XilinxSystem Generator for DSP and AccelDSP design tools. When combined withour library of algorithmic IP and hardware platforms, Xilinx XtremeDSPSolutions get you from design to verified silicon faster than ever.

Visit www.xilinx.com/dsp today and speed your design with AccelDSP andSystem Generator for DSP.

by Russ NelsonSenior Design Engineer Xilinx, Inc. [email protected]

Stacey SecatchStaff Design EngineerXilinx, [email protected]

United Parcel Service and FedEx arearguably two of the most sophisticatedpackage delivery services in the world, butthey still do not build their own trucks.They know that their strategic advantagelies in figuring out how to move packagesaround the world efficiently.

Take a lesson from these two successfulcompanies – let the new Xilinx® PacketQueue LogiCORE™ IP be the deliveryvehicle for your packets so that you canfocus on your strategic advantage instead ofwasting time and money on the mechanicsof packet buffering.

Packet Queue joins FIFO Generator andBlock Memory Generator in an impressive

portfolio of Xilinx no-cost memory cores.Packet Queue buffers packetized data, mul-tiplexes one or more channels together, pro-vides advanced features such asretransmission, discarding of unwanted orcorrupted packets, and includes interfacesfor complete scheduling control.

Using Xilinx CORE Generator™ soft-ware (as shown in Figure 1), you can con-figure and generate an optimized,pre-engineered solution to your protocolbridging, aggregation, or packet-bufferingapplication. And because Packet Queueprovides an optimized and supportedsolution at no cost, you will realize realsavings in terms of NRE and time to mar-ket. In this article, we’ll highlight PacketQueue’s features and show how it canenhance two sample networking designs.

Channelized Data TransferSeldom does the output from a systemlook identical to the input. This is partic-ularly the case in networking applica-tions, where data is often encapsulatedand re-encapsulated in different protocols

as it travels through the network. PacketQueue is particularly suited for thesesystems. It supports as many as 32 inputchannels, transparently performing data-width conversion while simultaneouslymigrating data from each independentinput clock domain to the output clockdomain. Packet Queue’s sideband datafeature is also extremely versatile,enabling unusual data widths and addi-tional control or status signaling.

Shared Memory ArchitecturePacket Queue reduces your design cost,not only as a no-cost core but also byreducing FPGA resources versus tradi-tional implementations. Packet buffer-ing memory is segmented and allocatedto packets dynamically as they are writ-ten to the core. This enables sharing ofmemory between data from all channels.You can therefore configure the overallsize of the core to represent the peakrequirements of the system, as opposedto the sum of the peak requirements ofeach channel.


Eliminate Packet Buffering BusyworkEliminate Packet Buffering BusyworkThe Packet Queue IP core simplifies packet buffering and aggregation, letting you focus on high-level system design.The Packet Queue IP core simplifies packet buffering and aggregation, letting you focus on high-level system design.

Overall memory needs will vary greatlyfrom design to design. Packet Queueenables you to pick the solution that isright for your system’s specifications, result-ing in memory savings that translate direct-ly to a lower unit cost for you.

Retransmit and DiscardIf your application requires the ability torecover from receiving corrupted packetsor to resend packets corrupted duringtransmission, Packet Queue provides sim-ple yet powerful options. Packet Queue’sinput logic supports the discard of anincoming packet at any point before pack-et completion by asserting a single signal.This functionality can also be used to dis-card unwanted packets, such as those withinvalid cyclic redundancy checks (CRCs).Similarly, you can interrupt a packet beingtransmitted with a single signal, enablingretransmission at a later time. Together,these two simple yet powerful featuresallow Packet Queue to easily interfacewith unreliable links.

Complete Scheduling ControlThe most powerful feature of Packet Queueisn’t something included in the core – it’ssomething not included in the core. As adesigner of a networking system, your com-petitive advantage lies in your schedulingalgorithms. Packet Queue provides statusand control ports for directing multiplexing

Aggregation ExampleOur first example application is that of anEthernet bridge, between ten 1-Gb portsand a single 10-Gb port, as shown in Figure2. Packet Queue can assist you with aggre-gation. Several of its features come intoplay here:

• Multiple channels – this configurationuses 10 of Packet Queue’s 32 availableinput channels. You can also configuremore input ports for oversubscriptionof the 10-Gb link.

• Clock conversion – the Xilinx 1-GbEthernet MAC LogiCORE solution(GEMAC) provides data at a frequencyof 125 MHz. The Xilinx 10-GbEthernet MAC LogiCORE solution(10GEMAC) expects data at 156.25MHz. Packet Queue seamlessly resyn-chronizes data from one clock domain tothe other, allowing each input GEMACcore to exist in its own clock domain.

• Width conversion – the GEMAC coreprovides 8-bit data, while the10GEMAC core expects 64-bit data.

between data channels, letting you decidehow to transfer data based on the require-ments of your system – whether that meansround robin, weighted round robin, fixedpriority, or anything else you can design.For example, you can choose to give priori-ty to channels that require a higher qualityof service (QoS) to reduce queuing latency.


Packet Queue 10GMAC Tx

User Control and Scheduling Logic

GMAC Rx1

GMAC Rx2

GMAC Rx10

Data

Data

Data

Data

ControlControlControl

Figure 1 – Packet Queue GUI showing potential configuration for Ethernet example.

Figure 2 – Packet Queue aggregation of 10 Ethernet channels

Packet Queue automatically convertsbetween the different widths.

• Scheduling control – Packet Queue’sversatile scheduling interfaces enableyou to implement any algorithm youdesire to control data flow through thecore. A simple system might use around-robin scheme, while a morecomplex implementation could includeQoS, giving priority to particular portsbased on the type of data expected.

• Shared memory architecture – thedynamic memory allocation used byPacket Queue is particularly suited toEthernet because it supports variableframe sizes, including jumbo frames.Because memory is shared betweenchannels, the total memory needed isreduced, while still handling burstydata. Simply determine the peak out-standing data and apply it to theshared memory size instead of the sizeof each channel. For example, the coreas configured in the GUI shown inFigure 1 supports as many as threestored jumbo frames at any given time,in addition to smaller standard frames.

Buffering ExampleOur second example application, shown inFigure 3, uses Packet Queue to provide

transmit buffering and retry capabilities forthe Xilinx Serial RapidIO LogiCORE solu-tion (SRIO). The RapidIO protocol allowsfor as many as 31 outstanding (unacknowl-edged) packets, with the possibility ofretransmitting any or all of the outstandingpackets. Packet Queue is ideally suited toprovide this amount of buffering, as well asproviding other features to make the inte-gration as simple and seamless as possible.These features include:

• LocalLink Interfaces – both PacketQueue and SRIO use the XilinxLocalLink standard for their data inter-faces. This enables you to connect thetwo cores’ data interfaces togetherwithout any glue logic. No glue logicmeans faster development and less timespent debugging.

• Retransmit packets – Packet Queue’sretransmit capability meshes quite wellwith the requirements of RapidIO. TheSRIO core provides two interfacesthrough which packet retransmitrequests are made. When there is a prob-lem with the packet currently on thetransmit interface, SRIO asserts its “dis-continue” signal and the packet isretransmitted by the Packet Queue with-out any further intervention by yourcontrol logic. When retransmission of

more than one packet is required, yourcontrol logic then translates the requestfor all of the failed packets currentlylocated in the Packet Queue and sched-ules them for resend as normal. Followingpacket acknowledgment, your logic theninstructs Packet Queue to free the memo-ry that was allocated.

• Sideband data and channelization –RapidIO optionally allows packets withdifferent priorities. Packet Queue sup-ports this in either of two ways, givingyou the flexibility to pick the implemen-tation that best suits your systemrequirements. A simpler way is to con-figure the core with a single bit of side-band data that is mapped to the SRIOcore’s critical request input. Doing socauses the critical request flag to passthrough the Packet Queue along withthe associated packet.

The more complex (but more power-ful) method for implementing prioritiesis to use multiple Packet Queue chan-nels, with prioritized and critical requestpackets written to separate channels.This strategy allows higher priority pack-ets to pass lower priority packets in thePacket Queue, but requires you toimplement more complex schedulingand retransmit control logic.

ConclusionThe Xilinx Packet Queue LogiCORE IP is asimple yet powerful solution for protocolbridging, aggregation, and packet bufferingsystems. In addition to the examples we’ve dis-cussed, Packet Queue is great for bridgingSPI-3 to SPI-4.2, SPI-4.2 to XAUI, PCI toPCI Express, and countless other protocolcombinations. As networking protocols multi-ply and networks carry more diverse types ofdata, the need for these types of systems willonly grow. Packet Queue provides a cost-effec-tive solution for your designs and lets youdeliver your product to your customers earlier.

For more information about the PacketQueue, visit www.xilinx.com/systemio/packet_queue. Feedback and suggestions fornew features are always welcome. To con-tact the Packet Queue team, e-mail [email protected].


PacketQueue

SRIO Tx

User Control Logic

DataSource

OptionalPriority

Demuxing

User Scheduling Logic

Data Data

Data

RetransmitRetransmit and

Scheduling

Control Control

Figure 3 – Serial RapidIO transmit buffering with optional priority reordering


P r o c e s s i n g a n d F P G A - I n p u t / O u t p u t - D a t a R e c o r d i n g - B u s A n a l y z e r s

Stay Ahead!Stay Ahead!

with the PMC-FPGAO5 Range of Virtex-5 PMCs

PCI based development system also available

Xilinx Virtex-5 LX FPGAA new generation of performance

Analog I/O, CameraLink, LVDS, FPDP-II &RS422/485 optionsFast, integrated I/O without bottlenecks

Multiple banks of fastmemoryDSP & I/O optimized memoryarchitecture

PCI-X Interface withmultiple DMA controllersMore than 1GB/s bandwidth to host

Libraries and ExampleCodeEasy to use with head-start time-to-market

For more information, please visit virtex5.vmetro.com or call (281) 584 0728

-

North AmericaAug. 8-10 NI Week 2006 Austin, TX

Aug. 16 Next-Gen Networking Chips San Jose, CA

Sept. 26-28 MAPLD 2006 Washington, DC

Oct. 16-18 Convergence 2006 Detroit, MI

Oct. 23-25 MILCOM 2006 Washington, DC

EuropeSept. 27-29 RADECS Athens, Greece

Sept. 7-13 IBC Amsterdam, Holland

JapanSept. 28 Hi Speed Interface

Workshop Tokyo, Japan

Xilinx Events Xilinx participates in numerous trade shows

and events throughout the year. This is a perfect opportunity to meet our silicon and software experts, ask questions, see

demonstrations of new products, and hearother customer success stories.

For more information and the current schedule, visit www.xilinx.com/events/.

Xil

inx V

irte

x™-4

FPG

As

ww

w.x

ilin

x.c

om

/de

vic

es/


4VFX

124V

FX20

4VFX

404V

FX60

4VFX

100

4VFX

140

240

320

4VSX

254V

SX35

4VSX

55

320

448

4VLX

154V

LX25

4VLX

404V

LX60

4VLX

804V

LX10

04V

LX16

04V

LX20

0

240

240

SF36

317

x 1

7 m

m—

240

Pack

age

1A

rea

MG

T2Pi

ns

320

448

448

448

FF66

827

x 2

7 m

m

—44

8

640

640

640

768

768

768

FF11

4835

x 3

5 m

m

—76

8

960

960

960

FF15

1340

x 4

0 m

m

—96

0

320

(8)3

352

(12)

335

2 (1

2)3

FF67

227

x 2

7 m

m

1235

2

448

(12)

357

6 (1

6)3

576

(20)

3FF

1152

35 x

35

mm

20

576

768

(20)

376

8 (2

4)3

FF15

1740

x 4

0 m

m

2476

8

896

(24)

3FF

1760

42.5

x 4

2.5

mm

24

896

Not

es: 1

. SFA

Pac

kage

s (S

F): f

lip-c

hip

fine-

pitc

h BG

A (0

.80

mm

bal

l spa

cing

).

F

FA P

acka

ges

(FF)

: flip

-chi

p fin

e-pi

tch

BGA

(1.0

0 m

m b

all s

paci

ng).

All V

irtex

-4 L

X an

d Vi

rtex

-4 S

X de

vice

s av

aila

ble

in th

e sa

me

pack

age

are

foot

prin

t-co

mpa

tible

.

2. M

GT:

Rock

etIO

Mul

ti-G

igab

it Tr

ansc

eive

rs.

3. N

umbe

r of a

vaila

ble

Rock

etIO

Mut

i-Gig

abit

Tran

scei

vers

.

4.

Eas

yPat

h so

lutio

ns p

rovi

de c

onve

rsio

n-fre

e pa

th fo

r vol

ume

prod

uctio

n.

Pb

-free

sol

utio

ns a

re a

vaila

ble.

For

mor

e in

form

atio

n ab

out P

b-fre

e so

lutio

ns, v

isit

ww

w.x

ilinx

.com

/pbf

ree.

Easy

Path

™ C

ost

Redu

ctio

n So

luti

ons1

—XC

E4VL

X25

XCE4

VLX4

0XC

E4VL

X60

XCE4

VLX8

0XC

E4VL

X100

XCE4

VLX1

60XC

E4VL

X200

XCE4

VSX2

5XC

E4VS

X35

XCE4

VSX5

5—

XCE4

VFX2

0XC

E4VF

X40

XCE4

VFX6

0XC

E4VF

X100

XCE4

VFX1

40

XC4V

FX12

XC4V

FX20

XC4V

FX40

XC4V

FX60

XC4V

FX10

0XC

4VFX

140

64 x

24

64 x

36

96 x

52

128

x 52

160

x 68

192

x 84

10,9

4417

,088

37,2

4850

,560

84,3

5212

6,33

6

5,47

28,

544

18,6

2425

,280

42,1

7663

,168

12,3

1219

,224

41,9

0456

,880

94,8

9614

2,12

8

87,5

5213

6,70

429

7,98

440

4,48

067

4,81

61,

010,

688

3668

144

232

376

552

648

1,22

42,

592

4,17

66,

768

9,93

6

XC4V

SX25

XC4V

SX35

XC4V

SX55

64 x

40

96 x

40

128

x 48

20,4

8030

,720

49,1

52

10,2

4015

,360

24,5

76

23,0

4034

,560

55,2

96

163,

840

245,

760

393,

216

192

320

2,30

43,

456

5,76

0

128

XC4V

LX15

XC4V

LX25

XC4V

LX40

XC4V

LX60

Virt

ex-4

LX

(Log

ic)

XC4V

LX80

XC4V

LX10

0XC

4VLX

160

XC4V

LX20

0

64 x

24

96 x

28

128

x 36

128

x 52

160

x 56

192

x 64

192

x 88

192

x 11

6CL

B A

rray

(Row

x C

olum

n)

12,2

8821

,504

36,8

6453

,248

71,6

8098

,304

135,

168

178,

176

CLB

Flip

Flo

ps

CLB

Reso

urce

s6,

144

10,7

5218

,432

26,6

2435

,840

49,1

5267

,584

89,0

88Sl

ices

13,8

2424

,192

41,4

7259

,904

80,6

4011

0,59

215

2,06

420

0,44

8Lo

gic

Cells

Mem

ory

Reso

urce

s

Cloc

k Re

sour

ces

I/O R

esou

rces

Embe

dded

H

ard

IPRe

sour

ces

Pow

erPC

™ P

roce

ssor

Blo

cks

10/1

00/1

000

Ethe

rnet

MAC

Blo

cks

Rock

etIO

™ S

eria

l Tra

nsce

iver

s

Virt

ex-4

FX

(Em

bedd

ed P

roce

ssin

g &

Ser

ial C

onne

ctiv

ity)

Virt

ex-4

SX

(Sig

nal P

roce

ssin

g)

98,3

0417

2,03

229

4,91

242

5,98

457

3,44

078

6,43

21,

081,

344

1,42

5,40

8M

ax. D

istr

ibut

ed R

AM

Bit

s

4872

9616

020

024

028

833

6Bl

ock

RAM

/FIF

O w

/ECC

(18

kbit

s ea

ch)

864

1,29

61,

728

2,88

03,

600

4,32

05,

184

6,04

8

44

812

1220

48

84

88

812

1212

12

00

48

88

04

40

44

48

88

8

320

320

448

576

768

896

320

448

640

320

448

640

640

768

960

960

960

99

1113

1517

911

139

1113

1315

1717

17

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

160

160

224

288

384

448

160

224

320

160

224

320

320

384

480

480

480

3232

4812

816

019

212

819

251

232

4864

6480

9696

96

11

22

22

——

——

——

——

——

—

22

44

44

——

——

——

——

——

—

08

1216

2024

——

——

——

——

——

—

-10,

-11,

-12

-10,

-11,

-12

-10,

-11,

-12

-10,

-11,

-12

-10,

-11,

-12

-10,

-11

-10,

-11,

-12

-10,

-11,

-12

-10,

-11,

-12

-10,

-11,

-12

-10,

-11,

-12

-10,

-11,

-12

-10,

-11,

-12

-10,

-11,

-12

-10,

-11,

-12

-10,

-11,

-12

-10,

-11

-10,

-11

-10,

-11

-10,

-11

-10,

-11

-10,

-11

-10

-10,

-11

-10,

-11

-10,

-11

-10,

-11

-10,

-11

-10,

-11

-10,

-11

-10,

-11

-10,

-11

-10,

-11

-10

4,76

5,56

87,

242,

624

13,5

50,7

2021

,002

,880

33,0

65,4

0847

,856

,896

13,7

00,2

8822

,745

,216

4,76

5,56

87,

819,

904

12,2

59,7

1217

,717

,632

23,2

91,0

0830

,711

,680

40,3

47,0

0851

,367

,808

9,14

7,64

8

Tota

l Blo

ck R

AM

(kbi

ts)

Dig

ital

Clo

ck M

anag

ers

(DCM

)

Phas

e-m

atch

ed C

lock

Div

ider

s (P

MCD

)

Max

Sel

ect

I/O™

Tota

l I/O

Ban

ks

Dig

ital

ly C

ontr

olle

d Im

pede

nce

Max

Diff

eren

tial

I/O

Pai

rs

I/O S

tand

ards

Xtre

meD

SP™

Slic

es

DSP

Res

ourc

es

Spee

d G

rade

s

Com

mer

cial

(slo

wes

t to

fast

est)

Indu

stri

al (s

low

est

to fa

stes

t)

Conf

igur

atio

n M

emor

y Bi

ts

LDT-

25, L

VDS-

25, L

VDSE

XT-2

5, B

LVDS

-25,

ULV

DS-2

5, L

VPEC

L-25

, LVC

MO

S25,

LVC

MO

S18,

LVC

MO

S15,

PCI

33, L

VTTL

, LVC

MO

S33,

PCI

-X, P

CI66

, GTL

, GTL

+, H

STL

I (1.

5V,1

.8V)

, HST

L II

(1.5

V,1.

8V),

HSTL

III (

1.5V

,1.8

V), H

STL

IV (1

.5V,

1.8V

), SS

TL2I

, SST

L2II,

SST

L18

I, SS

TL18

II

Pro

du

ct S

ele

ctio

n M

atr

ix

Impo

rtan

t:Ve

rify

all

data

in t

his

docu

men

t w

ith

the

devi

ce d

ata

shee

ts f

ound

at

http

://w

ww

.xili

nx.c

om/p

arti

nfo/

data

book

.htm

Xil

inx S

part

an

™-3

FPG

As

an

d P

latf

orm

Fla

shw

ww

.xil

inx

.co

m/d

ev

ice

s/

Pro

du

ct S

ele

ctio

n M

atr

ixPack

ag

e O

pti

on

s an

d U

serI

/O1

CLB

Reso

urce

sM

emor

y Re

sour

ces

CLK

Reso

urce

sD

SPI/O

Fea

ture

sSp

eed

PRO

M

System Gates (see note 1)

CLB Array (Row x Col)

XC3S

5050

K16

x 1

2

Number of Slices

768

Equivalent Logic Cells

1,72

8

CLB Flip-Flops

1,53

6

Max. Distributed RAM Bits

12K

# Block RAM

4

Block RAM (bits)

72K

Dedicated Multipliers

4

DCM Frequency (min/max)

24/2

80

24/2

80

24/2

80

24/2

80

24/2

80

24/2

80

24/2

80

24/2

80

# DCMs

2

Digitally Controlled Impedance

Number of Differential I/O Pairs

Maximum I/O

I/O Standards

Commercial Speed Grades(slowest to fastest)

YES

5612

4Si

ngle

-end

edLV

TTL,

LVC

MO

S3.3

/2.5

/1.8

/1.

5/1.

2, P

CI 3

.3V

– 32

/64-

bit

33M

Hz, S

STL2

Cla

ss I

& II

, SS

TL18

Cla

ss I,

HST

L Cl

ass

I, III

, HST

L1.8

Cla

ss I,

II &

III,

GTL

, GTL

+

Diffe

rent

ial

LVDS

2.5,

Bus

LVD

S2.5

,U

ltra

LVDS

2.5,

LVD

S_ex

t2.5

,RS

DS, L

DT2.

5, L

VPEC

L

Sing

le-e

nded

LVTT

L, L

VCM

OS3

.3/2

.5/1

.8/

1.5/

1.2,

PCI

3.3

V –

32/6

4-bi

t 33

/66M

Hz, P

CI-X

100

MHz

, SS

TL I

1.8/

2.5,

HST

L I 1

.8,

HSTL

III 1

.8

Diffe

rent

ial

LVDS

2.5,

Bus

LVD

S2.5

, m

ini-L

VDS,

RSD

S, L

VPEC

L

-4 -5

Industrial Speed Grades(slowest to fastest)

-4

Configuration Memory (Bits)

EasyPath

.4M

XC3S

200

200K

24

x 2

01,

920

4,32

03,

840

30K

1221

6K12

4YE

S76

173

-4 -5

-41.

0M

XC3S

400

400K

32

x 2

83,

584

8,06

47,

168

56K

1628

8K16

4YE

S11

626

4-4

-5-4

1.7M

XC3S

1000

10

00K

48 x

40

7,68

017

,280

15,3

6012

0K24

432K

244

YES

175

391

-4 -5

-43.

2M

XC3S

1500

15

00K

64 x

52

13,3

1229

,952

26,6

2420

8K32

576K

324

YES

221

487

-4 -5

-45.

2M

XC3S

2000

20

00K

80 x

64

20,4

8046

,080

40,9

6032

0K40

720K

404

YES

270

565

-4 -5

-47.

7M

XC3S

4000

XC3S

1000

L

XC3S

1500

L

XC3S

4000

L 40

00K

96 x

72

27,6

4862

,208

55,2

9643

2K96

1,72

8K96

4YE

S31

271

2-4

-5-4

11.3

M

XC3S

5000

50

00K

104

x 80

33,2

8074

,880

66,5

6052

0K10

41,

872K

104

4YE

S34

478

4-4

-5-4

13.3

M

Not

e:

1. S

yste

m G

ates

incl

ude

20-3

0% o

f CLB

s us

ed a

s RA

Ms.

2.

Spa

rtan

-3L

devi

ces

offe

r red

uced

qui

esce

nt p

ower

con

sum

ptio

n. P

acka

ge o

fferin

gs m

ay v

ary

slig

htly

from

thos

e of

fere

d in

the

Spar

tan-

3 fa

mily

. Se

e Pa

ckag

e Se

lect

ion

Mat

rix fo

r det

ails.

Spar

tan-

3 an

d Sp

arta

n-3L

Fam

ilies

– 1

.2 V

olt

(see

not

e 2)

XC3S

100E

100K

16 x

22

2,16

019

204

72K

45/

326

2N

O10

80.

6M

XC3S

250E

250K

26

x 3

45,

508

4896

1221

6K12

5/32

64

NO

172

1.4M

XC3S

500E

50

0K

34 x

46

10,4

7693

1220

360K

205/

326

4N

O23

22.

3M

XC3S

1200

E12

00K

46 x

60

19,5

1217

344

2850

4K28

5/32

68

NO

304

3.8M

XC3S

1600

E 16

00K

58 x

76

33,1

92

960

2448

4656

8672

1475

229

504

15K

38K

73K

136K

231K

3664

8K36

5/32

68

NO

376

40 68 92 124

156

5.9M

-4 -5

-4 -5

-4 -5

-4 -5

-4 -5

-4 -4 -4 -4 -4

Spar

tan-

3E F

amily

– 1

.2 V

olt

✔ ✔✔ ✔

Impo

rtan

t:Ve

rify

all

data

in t

his

docu

men

t w

ith

the

devi

ce d

ata

shee

ts f

ound

at

http

://w

ww

.xili

nx.c

om/p

arti

nfo/

data

book

.htm

FPG

A a

nd C

PL

D D

evic

esw

ww

.xili

nx.c

om/d

evic

es/

Con

figu

rati

on a

nd S

tora

ge S

yste

ms

ww

w.x

ilinx

.com

/con

figso

lns/

Pack

agin

gw

ww

.xili

nx.c

om/p

acka

ging

/

Soft

war

ew

ww

.xili

nx.c

om/is

e/

For t

he la

test

info

rmat

ion

and

prod

uct s

pecif

icatio

ns o

n al

l Xilin

x pr

oduc

ts, p

lease

visi

t the

follo

wing

link

s:

Not

es:

1. N

umbe

rs in

tabl

e in

dica

te m

axim

um n

umbe

r of u

ser I

/Os.

2. A

rea

dim

ensi

ons

for l

ead-

fram

e pr

oduc

ts a

re in

clus

ive

of th

e le

ads.

Pb-fr

ee s

olut

ions

are

ava

ilabl

e. F

or m

ore

info

rmat

ion

abou

t Pb-

free

solu

tions

vis

it w

ww

.xili

nx.c

om/p

bfre

e.

XC3S100E

XC3S250E

XC3S500E

XC3S1200E

XC3S1600E

Are

a2Pi

nsI/O

’s10

817

223

230

437

6

30.6

x 3

0.6

mm

208

34.6

x 3

4.6

mm

240

16.0

x 1

6.0

mm

100

22.0

x 2

2.0

mm

144

12 x

12

mm

144

PQFP

Pac

kage

s (P

Q) –

wir

e-bo

nd p

last

ic Q

FP (0

.5 m

m le

ad s

paci

ng)

VQFP

Pac

kage

s (V

Q) –

ver

y th

in Q

FP (0

.5 m

m le

ad s

paci

ng)

TQFP

Pac

kage

s (T

Q) –

thi

n Q

FP (0

.5 m

m le

ad s

paci

ng)

Chip

Sca

le P

acka

ges

(CS)

– w

ire-

bond

chi

p-sc

ale

BGA

(0.8

mm

bal

l spa

cing

)

31 x

31

mm

900

35 x

35

mm

1156

17 x

17

mm

256

17 x

17

mm

256

23 x

23

mm

456

23 x

23

mm

484

27 x

27

mm

676

27 x

27

mm

256

19 x

19

mm

320

21 x

21

mm

400

FGA

Pac

kage

s (F

T) –

wir

e-bo

nd fi

ne-p

itch

thi

n BG

A (1

.0 m

m b

all s

paci

ng)

FGA

Pac

kage

s (F

G) –

wir

e-bo

nd fi

ne-p

itch

BG

A (1

.0 m

m b

all s

paci

ng)

BGA

Pac

kage

s (B

G) –

wir

e-bo

nd s

tand

ard

BGA

(1.2

7 m

m b

all s

paci

ng)

Spar

tan-

3E (1

.2V)

XC3S50

XC3S200

XC3S400

XC3S1000

XC3S1500

XC3S2000

XC3S4000

I/O’s

124

173

264

391

487

565

712

784XC3S5000

6366

66

108

108

158

158

190

250

250

172

190

232

304

304

63

97

124

141

141

97 8992

92

97

565

633

633

712

784

264

333

391

487

489

489

173

173

173

333

333

221

221

221

Spar

tan-

3 (1

.2V)

376

Chip

Sca

le P

acka

ges

(CP)

– w

ire-

bond

chi

p-sc

ale

BGA

(0.5

mm

bal

l spa

cing

)

8 x

8 m

m13

2

XC

F01S

XC

F02S

XC

F04S

XC

F08P

XC

F16P

XC

F32P

Dens

ity

1 M

b 2

Mb

4 M

b 8

Mb

16 M

b 32

Mb

JTAG

Pro

g �

�

�

�

�

�

Seria

l Con

fig

�

�

�

�

�

�

Sele

ctM

ap C

onfig

�

�

�

Com

pres

sion

�

�

�

Desi

gn R

evis

ions

�

�

�

VCC

(V)

3.3

3.3

3.3

1.8

1.8

1.8

VCCO

(V)

1.8

– 3.

3 1.

8 –

3.3

1.8

– 3.

3 1.

5 –

3.3

1.5

– 3.

3 1.

5 –

3.3

VCCJ

(V)

2.5

– 3.

3 2.

5 –

3.3

2.5

– 3.

3 2.

5 –

3.3

2.5

– 3.

3 2.

5 –

3.3

Cloc

k(M

Hz)

33

33

33

40

40

40

Pack

ages

VO

20

VO20

VO

20

FS48

FS

48

FS48

VO48

VO

48

VO48

Pb-F

ree

Pkg

VOG

20

VOG

20

VOG

20

FSG

48

FSG

48

FSG

48

VOG

48

VOG

48

VOG

48

Avai

labi

lity

Now

N

ow

Now

N

ow

Now

N

ow


Pla

tfo

rm F

lash

Featu

res

Dev

elop

men

t R

efer

ence

Boa

rds

ww

w.x

ilinx

.com

/boa

rd_s

earc

h/

IP R

efer

ence

ww

w.x

ilinx

.com

/ipce

nter

/

Pla

tfor

m F

lash

ww

w.x

ilinx

.com

/pro

duct

s/sili

con_

solu

tion

s/pro

ms/p

fp/

Glo

bal S

ervi

ces

ww

w.x

ilinx

.com

/supp

ort/g

sd/

Xil

inx C

PLD

ww

w.x

ilin

x.c

om

/de

vic

es/


XCR3

032X

L75

032

XCR3

064X

L1,

500

64

XCR3

128X

L3,

000

128

XCR3

256X

L6,

000

256

XCR3

384X

L9,

000

384

XCR3

512X

L12

,000

512

48 48 48 48 48

365

-5 -7

-10

-7 -1

04

16

686

-6 -7

-10

-7 -1

04

16

108

6-6

-7 -1

0 -7

-10

416

164

7.5

-7 -1

0 -1

2-1

0 -1

24

16

220

7.5

-7 -1

0 -1

2-1

0 -1

24

16

3.3/

5

3.3/

5

3.3/

5

3.3/

5

3.3/

5

3.3/

5

3.3

3.3

3.3

3.3

3.3

3.3

260

7.5

-7 -1

0 -1

2-1

0 -1

24

16

XC2C

32A

750

32

XC2C

64A

1,50

064

XC2C

128

3,00

012

8

XC2C

256

6,00

025

6

XC2C

384

9,00

038

4

XC2C

512

12,0

0051

2

40 40 40 40 40 40

System Gates

Macrocells

Product Terms per Macrocell

1.5/

1.8/

2.5/

3.3

Input Voltage Compatible

1.5/

1.8/

2.5/

3.3

33

Output Voltage Compatible

Maximum I/O

23.

8

I/O Banking

Min. Pin-to-pin Logic Delay (ns)

-4 -6

-6

Commercial Speed Grades(fastest to slowest)

Industrial Speed Grades(fastest to slowest)

3Global Clocks

17

1.5/

1.8/

2.5/

3.3

1.5/

1.8/

2.5/

3.3

642

4.6

-5 -7

-73

17

1.5/

1.8/

2.5/

3.3

1.5/

1.8/

2.5/

3.3

100

25.

7-6

-7-7

317

1.5/

1.8/

2.5/

3.3

1.5/

1.8/

2.5/

3.3

184

25.

7-6

-7-7

317

1.5/

1.8/

2.5/

3.3

1.5/

1.8/

2.5/

3.3

240

47.

1-7

-10

-10

317

1.5/

1.8/

2.5/

3.3

1.5/

1.8/

2.5/

3.3

270

47.

1-7

-10

-10

317Product Term Clocks per

Function Block

Cool

Runn

er-II

Fam

ily –

1.8

Vol

t

Cool

Runn

er X

PLA

3 Fa

mily

– 3

.3 V

olt

-10

-10

-10

-12

-12

-12-6

IQ Speed Grade

-7 -7 -7 -10

-10

I/OFe

atur

esSp

eed

Cloc

king

48

Pro

du

ct S

ele

ctio

n M

atr

ix –

Co

olR

un

ner™

Seri

es

Pack

ag

e O

pti

on

s an

d U

ser

I/O

* JT

AG p

ins

and

port

ena

ble

are

not p

in c

ompa

tible

in th

is p

acka

ge fo

r thi

s m

embe

r of t

he fa

mily

.

Not

e 1:

Are

a di

men

sion

s fo

r lea

d-fra

me

prod

ucts

are

incl

usiv

e of

the

lead

s.

XC2C512

XCR3032XL

XCR3064XL

XCR3128XL

XCR3256XL

XCR3384XL

XCR3512XL

Are

a1Pi

ns

5 x

5 m

m32

XC2C32A

XC2C64A

XC2C128

XC2C256

XC2C384

XC2C512

30.6

x 3

0.6

mm

208

12.0

x 1

2.0

mm

44

21

7 x

7 m

m48

37

16.0

x 1

6.0

mm

100

118*

22.0

x 2

2.0

mm

144

486

x 6

mm

5633

45

8 x

8 m

m13

210

010

6

4036

7 x

7 m

m48

108

12 x

12

mm

144

164

16 x

16

mm

280

212

164

212

17 x

17

mm

256

184

212

212

260

220

23 x

23

mm

324

240

270

173

173

173

180

164

172

3333

3636

6884

6480

80

108

120

100

118

118

PQFP

Pac

kage

s (P

Q) –

wir

e-bo

nd p

last

ic Q

FP (0

.5 m

m le

ad s

paci

ng)

VQFP

Pac

kage

s (V

Q) –

ver

y th

in Q

FP (0

.5 m

m le

ad s

paci

ng)

TQFP

Pac

kage

s (T

Q) –

thi

n Q

FP (0

.5 m

m le

ad s

paci

ng)

FGA

Pac

kage

s (F

T) –

wir

e-bo

nd fi

ne-p

itch

thi

n BG

A (1

.0 m

m b

all s

paci

ng)

FBG

A P

acka

ges

(FG

) – w

ire-

bond

fine

-line

BG

A (1

.0 m

m b

all s

paci

ng)

Chip

Sca

le P

acka

ges

(CS)

– w

ire-

bond

chi

p-sc

ale

BGA

(0.8

mm

bal

l spa

cing

)

Chip

Sca

le P

acka

ges

(CP)

– w

ire-

bond

chi

p-sc

ale

BGA

(0.5

mm

bal

l spa

cing

)

Cool

Runn

er X

PLA

3Co

olRu

nner

-II

QFN

Pac

kage

s (Q

FG) –

qua

d fla

t no

-lead

(0.5

mm

lead

spa

cing

)

3636

17.5

x 1

7.5

mm

4433

33

PLCC

Pac

kage

s (P

C) –

wir

e-bo

nd p

last

ic c

hip

carr

ier

(1.2

7 m

m le

ad s

paci

ng)

Xil

inx C

PLD

ww

w.x

ilin

x.c

om

/de

vic

es/


Pro

du

ct S

ele

ctio

n M

atr

ix –

9500 S

eri

es

Pack

ag

e O

pti

on

s an

d U

ser

I/O

XC9536XV

XC9572XV

XC95144XV

XC95288XV

3434

XC9536XL

XC9572XL

XC95144XL

XC95288XL

3434

3434

8181

117

117

117

117

3836

3836

117

117

192

192

192

192

192

72

34

5236

72

Not

e 1:

Are

a di

men

sion

s fo

r lea

d-fra

me

prod

ucts

are

incl

usiv

e of

the

lead

s.

Are

a1Pi

ns

17.5

x 1

7.5

mm

44

23.3

x 1

7.2

mm

100

12.0

x 1

2.0

mm

44

12.0

x 1

2.0

mm

64

16.0

x 1

6.0

mm

100

22.0

x 2

2.0

mm

144

7 x

7 m

m48

12 x

12

mm

144

16 x

16

mm

280

27 x

27

mm

256

17 x

17

mm

256

XC9536

XC9572

XC95108

XC95144

3434 72

8181

8181

34

XC95216

XC95288

30.2

x 3

0.2

mm

8469

69

31.2

x 3

1.2

mm

160

108

133

133

168

168

30.6

x 3

0.6

mm

208

166

168

35.0

x 3

5.0

mm

352

166

192

FBG

A P

acka

ges

(FG

) – w

ire-

bond

Fin

e-lin

e BG

A (1

.0 m

m b

all s

paci

ng)

BGA

Pac

kage

s (B

G) –

wir

e-bo

nd s

tand

ard

BGA

(1.2

7 m

m b

all s

paci

ng)

3434XC

9500

XC95

00XL

XC95

00XV

72

TQFP

Pac

kage

s (T

Q) –

thi

n Q

FP (0

.5 m

m le

ad s

paci

ng)

VQFP

Pac

kage

s (V

Q) –

ver

y th

in T

QFP

(0.5

mm

lead

spa

cing

)

PQFP

Pac

kage

s (P

Q) –

wir

e-bo

nd p

last

ic Q

FP (0

.5 m

m le

ad s

paci

ng)

PLCC

Pac

kage

s (P

C) –

wir

e-bo

nd p

last

ic c

hip

carr

ier

(1.2

7 m

m le

ad s

paci

ng)

Chip

Sca

le P

acka

ges

(CS)

– w

ire-

bond

chi

p-sc

ale

BGA

(0.8

mm

bal

l spa

cing

)

System Gates

Macrocells

Product Terms per Macrocell

Input Voltage Compatible

Output Voltage Compatible

Maximum I/O

I/O Banking

Min. Pin-to-pin Logic Delay (ns)

Commercial Speed Grades(fastest to slowest)

Industrial Speed Grades(fastest to slowest)

Global Clocks

Product Term Clocks perFunction Block

IQ Speed Grade

I/OFe

atur

esSp

eed

Cloc

king

XC95

3680

036

XC95

721,

600

72

XC95

108

2,40

010

8

XC95

144

3,20

014

4

90 90 90 90

3610

-7 -1

0 -1

53

18

7210

-10

-15

318

108

10-7

-10

-15

-20

318

133

10

-5 -6

-10

-15

-7 -1

0 -1

5

-7 -1

0 -1

5 -2

0

-7 -1

0 -1

5-1

0 -1

5

-15

-15

NA

NA

318

5 5 5

55

5 5 5

XC95

216

4,80

021

690

166

10-1

0 -1

5 -2

0-1

0 -1

5 -2

0N

A3

185

5

XC95

288

6,40

0

Pb-fr

ee s

olut

ions

are

ava

ilabl

e. F

or m

ore

info

rmat

ion

abou

t Pb-

free

solu

tions

vis

it w

ww

.xili

nx.c

om/p

bfre

e

288

9019

210

-10

-15

-20

-15

-20

NA

318

55

XC95

00 F

amily

– 5

Vol

t

XC95

36XL

800

36

XC95

72XL

1,60

072

XC95

144X

L3,

200

144

XC95

288X

L6,

400

288

90 90 90 90

365

-7 -1

03

18

725

-7 -1

03

18

117

5-7

-10

318

192

6

-5 -7

-10

-5 -7

-10

-5 -7

-10

-6 -7

-10

-7 -1

0

-10

-10

NA

NA

318

2.5/

3.3/

5

2.5/

3.3/

5

2.5/

3.3/

5

2.5/

3.3/

5

2.5/

3.3

2.5

/3.3

2.5/

3.3

2.5

/3.3

XC95

00XL

Fam

ily –

3.3

Vol

t

XC95

36XV

800

36

XC95

72XV

1,60

072

XC95

144X

V3,

200

144

XC95

288X

V6,

400

288

90 90 90 90

365

-7

318

725

-7

318

117

5-7

3

18

192

6

1 1 2 4

-5 -7

-5 -7

-5 -7

-6 -7

-10

-7 -1

0

NA

NA

NA

NA

318

2.5/

3.3

2.5/

3.3

2.5/

3.3

2.5/

3.3

1.8/

2.5/

3.3

1.8

/2.5

/3.3

1.8/

2.5/

3.3

1.8/

2.5/

3.3

XC95

00XV

Fam

ily –

2.5

Vol

t

Xil

inx S

oft

ware

– I

SE 8

.2i

ww

w.x

ilin

x.c

om

/ise


Dev

ices

Feat

ure

Virt

ex™

Ser

ies

HDL

Edito

r / S

chem

atic

Edi

tor

ISE

Web

PACK

™

Virt

ex: X

CV50

- XC

V600

Virt

ex-E

: XCV

50E

- XCV

600E

Virt

ex-II

: XC2

V40

- XC2

V500

Virt

ex-II

Pro

: XC2

VP2

- XC2

VP7

Virt

ex-4

:

LX: X

C4VL

X15,

XC4

VLX2

5

SX: X

C4VS

X25

FX

: XC

4VFX

12Vi

rtex

Q: X

QV1

00- X

QV6

00Vi

rtex

QR:

XQ

VR30

0, X

QVR

600

Virt

ex-E

Q: X

QV6

00E

Virt

ex: A

llVi

rtex

-E: A

llVi

rtex

-II/V

irte

x-II

Pro:

All

Virt

ex-4

:

LX: A

ll

SX: A

ll

FX: A

llVi

rtex

-5:

LX

: XC5

VLX3

0 - X

C5VL

X220

Vi

rtex

Q/Q

R: A

ll Vi

rtex

-E Q

: All

ISE

Foun

dati

on™

Feat

ure

ISE

Web

PACK

™IS

E Fo

unda

tion

™

Spar

tan™

Ser

ies

Spar

tan-

II/IIE

: All

Spar

tan-

3: X

C3S5

0 - X

C3S1

500

Spar

tan-

3E: A

llSp

arta

n-3L

: XC3

S100

0L, X

C3S1

500L

XA (X

ilinx

Aut

omot

ive)

Spa

rtan

-3: A

ll

Spar

tan-

II/IIE

: All

Spar

tan-

3: A

llSp

arta

n-3E

: All

Spar

tan-

3L: A

llXA

(Xili

nx A

utom

otiv

e) S

part

an-3

: All

Cool

Runn

er™

XPL

A3Co

olRu

nner

-II /

Cool

Runn

er-II

A

XC95

00™

Ser

ies

Des

ign

Entr

y

Stat

e Di

agra

m E

dito

r

Xilin

x CO

RE G

ener

ator

™ S

yste

m

RTL

& T

echn

olog

y Vi

ewer

s

Arch

itect

ure

Wiz

ards

3rd

Part

y RT

L Ch

ecke

r Sup

port

Xilin

x Sy

stem

Gen

erat

or fo

r DSP

Embe

dded

Des

ign

Kit (

EDK)

Embe

dded

Des

ign

XST

– Xi

linx

Synt

hesi

s Te

chno

logy

Synt

hesi

s

Plat

form

sM

icro

soft

Win

dow

s 20

00 /

XPRe

d Ha

t Ent

erpr

ise

Linu

x 3/

4 (3

2 bi

t)M

icro

soft

Win

dow

s 20

00 /

XPSu

n So

laris

2.8

or 2

.9

Red

Hat E

nter

pris

e Li

nux

3/4

(32

& 6

4 bi

t)

Men

tor G

raph

ics®

Leo

nard

o Sp

ectr

um™

Men

tor G

raph

ics

Prec

isio

n RT

L™

Men

tor G

raph

ics

Prec

isio

n Ph

ysic

al™

Synp

lify/

Pro/

Prem

ier

ABEL

Yes

All

All

Mic

roso

ft W

indo

ws

only

Yes

Yes

Yes

Yes

Sold

as

an O

ptio

n

Sold

as

an O

ptio

n

Yes

Inte

grat

ed In

terfa

ce (E

DIF

Inte

rface

on

Linu

x)

Inte

grat

ed In

terfa

ce

EDIF

Inte

rface

Inte

grat

ed In

terfa

ce

CPLD

(Mic

roso

ft W

indo

ws

only

)

Floo

rPla

nner

Impl

emen

tati

on

Xplo

rer

Prog

ram

min

gIM

PACT

/ Sy

stem

ACE

™ /C

able

Serv

er

Boar

d Le

vel

Inte

grat

ion

ELDO

Mod

els*

(MG

T on

ly)

Veri

ficat

ion

Chip

Scop

e™ P

roCh

ipSc

ope

Pro

Seria

l IO

Too

lkit

ISE

Sim

ulat

or L

ite

ISE

Sim

ulat

or

Mod

elSi

m®

XE

III S

tart

er

Mod

elSi

m X

E III

Stat

ic T

imin

g An

alyz

er

FPG

A Ed

itor w

ith P

robe

Chip

View

er

XPow

er (P

ower

Ana

lysi

s)

3rd

Part

y Eq

uiva

lenc

y Ch

ecki

ng S

uppo

rt

SMAR

TMod

els

for P

ower

PC™

an

d Ro

cket

IO™

3rd

Part

y Si

mul

ator

Sup

port

*HSP

ICE

and

ELDO

Mod

els

are

avai

labl

e at

the

Xilin

x De

sign

Too

ls C

ente

r at w

ww

.xili

nx.c

om/is

eFo

r mor

e in

form

atio

n, v

isit

ww

w.x

ilinx

.com

/ise

IBIS

/ ST

AMP

/ HSP

ICE*

Mod

els

Plan

Ahea

d™

Tim

ing

Driv

en P

lace

& R

oute

Part

ition

Sup

port

Incr

emen

tal D

esig

n

Tim

ing

Impr

ovem

ent W

izar

d

Yes

Yes

Yes

Yes

Sold

as

an O

ptio

nFr

ee d

ownl

oada

ble

eval

uati

on a

t w

ww

.xili

nx.c

om/c

hips

cope

pro

Yes

Gra

phic

al T

estb

ench

Edi

tor

Mic

roso

ft W

indo

ws

only

ISE

Sim

ulat

or is

ava

ilabl

e as

an

Opt

iona

l Des

ign

Tool

for I

SE F

ound

atio

n on

ly

Yes

Sold

as

an O

ptio

n

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Sold

as

an O

ptio

nFr

ee d

ownl

oada

ble

eval

uati

on a

t w

ww

.xili

nx.c

om/p

lana

head

Yes

Inte

grat

ed T

imin

g Cl

osur

e En

viro

nmen

tYe

s (V

irtex

-5 o

nly)

No

Yes

Yes

Yes

ISE™

8.2

i –

Po

were

d b

y I

SE F

max T

ech

no

log

y

New High Speed ASIC Prototyping Engine - Now shipping!• Logic prototyping system with 2-16 Xilinx Virtex™-4 FPGAs

- 14 LX100/160/200 and 2 FX60/100- Nearly 24M ASIC gates (LSI measure - not inflated)

• FPGA interconnect is single-ended or LVDS (350MHz & 700mbit/s)• Synplicity Certify™ models for partitioning assistance• 4 separate DDR2 SODIMMs (200MHz)

- 64-bit data width, 200MHz (400mbit/s), 4GB/each- Optional modules: Mictor, SSRAM, RLDRAM, FLASH

• Multi-gigabit serial I/O interfaces via XFP, SFP, and SMAs• Flexible customization via daughter cards

- 1200 I/O pins, LVDS capable, 350MHz• Fast and Painless FPGA configuration via Compact FLASH, or USB• Full support for embedded logic analyzers via JTAG interface:

ChipScope™, Identify™The DINI Group1010 Pearl Street, Suite 6La Jolla, CA 92037 (858) 454-3419www.dinigroup.com

16 Virtex-4 FPGAs24 million ASIC gates

Try our other

Virtex-4 products

DN8000k10

Instant 10APower Supply

Complete, Quick & Ready.

, LTC, LT, LTM and PolyPhase are registered trademarks and µModule is a trademark of Linear Technology Corporation. All other trademarks are the property of their respective owners.

The LTM®4600 is a complete 10A switchmode step-down power supply with a built-in inductor, supporting power components and compensation circuitry. With high integration and synchronous current mode operation, this DC/DC µModuleTM delivers high power at high efficiency in a tiny, low profile surface mount package. Supported by Linear Technology’s rigorous testing and high reliability processes, the LTM4600 simplifies the design and layout of your next power supply.

Ultrafast Transient Response2% ³VOUT with a 5A Step

VIN = 12V, VOUT = 1.5V, 0A to 5A Load Step(COUT = 3 x 22µF CERAMICS, 470µF POS CAP)

• 15mm x 15mm x 2.8mm LGA with15°C/W JA

• Pb-Free (e4), RoHS Compliant• Only CBULK Required• Standard and High Voltage:

LTM4600EV: 4.5V VIN 20VLTM4600HVEV: 4.5V VIN 28V

• 0.6V VOUT 5V• IOUT: 10A DC, 14A Peak• Parallel Two µModules for

20A Output

Features

Nu Horizons Electronics Corp.

Tel: 1-888-747-NUHO

www.nuhorizons.com/linear

Information

FREELTM4600µModule Board��

HighVELOCITY

L E A R N I N G

Nu Horizons Electronics Corp. is proud to present our newest education and training program - XpressTrack - which offers engineers the

opportunity to participate in technical seminars conducted around the country by experts focused on the latest technologies from Xilinx. This

program provides higher velocity learning to help minimize start-up time to quickly begin your design process utilizing the latest development tools,

software and products from both Nu Horizons and Xilinx.

For a complete list of course offerings, or to register for a seminar near you, please visit:

www.nuhorizons.com/xpresstrack

Courses Available

Optimizing MicroBlaze Soft Processor Systems

• 4 hour class• Covers building a complete

customized MicroBlaze soft processor system

Video/Imaging Algorithms in FPGAs

• 1 day class• Verify designs onto actual

hardware using Ethernet-based hardware-in-the-loop co-simulation.

Introduction to Embedded PowerPC and Co-Processor Code Accelerators

• 4 hour class• Covers how to build a high

performance embedded PowerPC system

Xilinx Spotlight: Virtex 5 Architectural Overview

• Learn about the new 65 nm family of FPGAs from Xilinx

• Understand how the new ExpressFabric will improve logic utilization

Fundamentals of FPGA• 1 day class• Covers ISE 8.1 features

ISE Design Entry• 1 day class• Covers XST, ECS, StateCAD

and ISE simulator

Fundamentals of CPLD Design• 1 day class• Covers CPLD basics and

interpreting reports for optimum performance

Design Techniques for Low Cost• 1 day class• Covers developing low cost

products particularly in high volume markets

VHDL for Design Engineers• 1 day class• CoversVHDL language and

implementation for FPGAs & CPLDs

The Programmable Logic CompanySM

www.xilinx.com/virtex5

Achieve highest system speed and better design margin withthe world’s first 65nm FPGAs.

Virtex™-5 FPGAs feature ExpressFabric™ technology on 65nm triple-oxide

process. This new fabric offers the industry’s first LUT with six independent

inputs for fewer logic levels, and advanced diagonal interconnect to enable

the shortest, fastest routing. Now you can achieve 30% higher

performance, while reducing dynamic power by 35% and area by 45%

compared to previous generations.

Design systems faster than ever before

Shipping now, Virtex-5 LX is the first of four platforms optimized for

logic, DSP, processing, and serial. The LX platform offers 330,000 logic

cells and 1,200 user I/Os, plus hardened 550 MHz IP blocks. Build deeper

FIFOs with 36 Kbit block RAMs. Achieve 1.25 Gbps on all I/Os without

restrictions, and make reliable memory interfacing easier with enhanced

ChipSync™ technology. Solve SI challenges and simplify PCB layout with our

sparse chevron packaging. And enable greater DSP precision and dynamic

range with 550 MHz, 25x18 MACs.

Visit www.xilinx.com/virtex5, view the TechOnline webcast, and give

your next design the ultimate in performance.

The Ultimate System Integration Platform

Ultimate Performance…Pe

rfo

rman

ce

LogicFabric

Performance

On-chipRAM

550 MHz

DSP32-Tap Filter

550 MHz

I/O LVDSBandwidth750 Gbps

I/O MemoryBandwidth384 Gbps

Virtex-5 FPGAs Virtex-4 FPGAs Nearest Competitor

Numbers show comparision with nearest competitorBased on competitor’s published datasheet numbers

1.4 x

1.3 x 1.6 x

2.4 x

4.4 x

Industry’s fastest 90nm FPGA benchmark

©2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners. PN 0010956

Documents

Xcell - Xilinx