Xcell Journal Issue 73

Issue 73Fourth Quarter 2010

Xcell journalXcell journalS O L U T I O N S F O R A P R O G R A M M A B L E W O R L DS O L U T I O N S F O R A P R O G R A M M A B L E W O R L D

INSIDE

FPGA Partial ReconfigurationGoes Mainstream

Resurrecting the Mighty Cray on a Spartan-3 FPGA

How Hardware Accelerators Can Speed Your Processor’s Sine Calculations

Xilinx Releases 2009 Quality Report

INSIDE

FPGA Partial ReconfigurationGoes Mainstream

Resurrecting the Mighty Cray on a Spartan-3 FPGA

How Hardware Accelerators Can Speed Your Processor’s Sine Calculations

Xilinx Releases 2009 Quality Report

www.xilinx.com/xcell/

FPGAs Enable Greener Future

for Industrial Motor Control

FPGAs Enable Greener Future

for Industrial Motor Control

©Avnet, Inc. 2010. All rights reserved. AVNET is a registered trademark of Avnet, Inc.

Introducing the First-ever Battery-powered Xilinx FPGA Development BoardThe low-cost Xilinx® Spartan®-6 FPGA LX16 Evaluation

Kit, designed by Avnet, combines a Spartan-6 LX16

FPGA with a Cypress PSoC® 3 controller and LPDRAM

memory. This kit demonstrates a versatile battery-

powered solution for low-power applications and

provides a sound development environment for the

most demanding applications. The baseboard can also

be expanded with new FPGA Mezzanine Card (FMC)

modules, making design porting and FMC swapping

nearly seamless between Avnet and Xilinx platforms.

FMC modules for ISM Networking, Dual Image Sensing

and DVI I/O are currently available from Avnet.

Spartan®-6 FPGA LX16Evaluation Kit Includes:

Spartan-6 LX16 DSP FPGA LCD add-on panel USB A-mini B cable ISE® WebPACK™ DVD AvProg configuration & programming utility Downloadable documentation& reference design

To purchase this kit, visit www.em.avnet.com/spartan6lx-evl or call 800.332.8638.

Toggle among banks of internal signals for incremental real-time internal measurements without:

See how you can save time by downloading our free application note.

www.agilent.com/find/fpga_app_note

Quickly see inside your FPGA

u.s. 1-800-829-4444 canada 1-877-894-4414© Agilent Technologies, Inc. 2009

Also works with all InfiniiVisionand Infiniium MSO models.

L E T T E R F R O M T H E P U B L I S H E R

Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/

© 2010 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. All other trade-marks are the property of their respective owners.

The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.

PUBLISHER Mike [email protected]

EDITOR Jacqueline Damian

ART DIRECTOR Scott Blair

DESIGN/PRODUCTION Teie, Gelwicks & Associates1-800-493-5551

ADVERTISING SALES Dan [email protected]

INTERNATIONAL Melissa Zhang, Asia [email protected]

Christelle Moraga, Europe/Middle East/[email protected]

Miyuki Takegoshi, [email protected]

REPRINT ORDERS 1-800-493-5551

Xcell journal

www.xilinx.com/xcell/

Xilinx Demonstrates Quality Commitment to Customersfew years ago, I wrote a cover story for EDN magazine on the subject of reliability in elec-tronic design. In the process of researching the topic, I learned many interesting things—first of which is that most companies don’t want to speak publicly about the quality of

their products. Most, not all.That original article was inspired by a shopping experience. While ringing up an Xbox 360 that

I was purchasing for the kids, the salesperson strongly suggested I also buy an aftermarket coolingunit that plugged into the console to keep it from getting too hot. At the time it seemed absurd thatafter forking over hundreds of dollars for a videogame system I’d need to pony up extra money tomake up for the possibility that there were deficiencies in the Xbox 360’s design.

Needless to say, I didn’t buy the cooling gizmo, and a year later (and a year after I wrote thereliability article) my Xbox 360 experienced what many Xbox 360 owners collectively called “thered ring of death.” Microsoft subsequently fixed the defect (rumored to be a thermal issue withan ASIC) for free for all its customers. The companymust have forked over untold amounts of cash to respinthe ASIC, set up the infrastructure to refurbish units,cover postage and organize the training for customerservice folks to say “I’ve never heard of the term ‘redring of death’” with some attempt at sincerity. Giventhe quasi recall, one has to wonder how much moresuccessful the Xbox would have been had Microsoft notencountered the quality issue.

Quality is really something everyone in electronicsshould be more mindful of, especially as IC siliconprocesses continue to shrink and can accommodate ever-more-elaborate designs that power a broader range ofapplications, spanning from consumer electronics tomedical and mission-critical systems.

So given this background, I was very proud to see ourquality team establish one of the few corporate qualityreports available for public consumption. For many years,our customers have recognized Xilinx for its outstandingquality. The 2009 quality report—downloadable in PDFform at https://docs.google.com/viewer?url=http://www.xilinx.com/publications/prod_mktg/2009-quality-annual-report.pdf—highlights Xilinx’s continued improvements in quality and customer satisfaction.

“Quality is no longer just about silicon,” said Vincent Tong, senior vice president of worldwidequality and new product introductions. “The whole Xilinx design experience and IP ecosystem arenow being transformed into the types of programs we’ve run for decades to drive relentless qualityimprovements and superb customer experiences.”

In keeping with this thrust, Tong says this year’s report will focus on Xilinx’s Targeted DesignPlatforms, with an emphasis on how quality is important in each element of what Xilinx delivers toits customers. “We offer outstanding FPGAs, but our commitment to quality does not end with sil-icon,” said Tong. “We are committed to improving the overall customer experience. Our customershave done amazing things with our products and we are committed to helping them become evenmore successful in creating wonderful new innovations.”

I encourage you to give the 2009 quality report a read.

A

Mike SantariniPublisher

C O N T E N T S

VIEWPOINTS XCELLENCE BY DESIGN APPLICATION FEATURES 2020

2424

Letter From the Publisher At Xilinx, Quality Extends Beyond Silicon…4 Xcellence in Industrial

FPGAs Fuel a Revolution in Intelligent Drive Design…12

Xcellence in A&D Using FPGAs in Mission-Critical Systems…16

Xcellence in Wireless Comms Virtex-4 Forms Foundation for GSM Security…20

Xcellence in Networking Research Design Platform Ignites Real-World Network Advances…24

Xcellence in New ApplicationsA Flexible Operating System for Dynamic Applications…30

Cover StoryFPGAs Create a Greener Future for Industrial Motor Control 88

1616

F O U R T H Q U A R T E R 2 0 1 0 , I S S U E 7 3

Xperiment Resurrecting the Cray-1 in a Xilinx FPGA…36

Xperts Corner FPGA Partial ReconfigurationGoes Mainstream…42

Xplanation: FPGA 101 Multirate SignalProcessing for High-Speed Data Converters…50

Ask FAE-X How to Speed Sine Calculations for Your Processor…54

XTRA READING

THE XILINX XPERIENCE FEATURES

5050

Are You Xperienced? Take advantage of a simplified training experience…60

Xtra, Xtra The latest Xilinx tool updates and patches, as of September 2010…62

Xclamations! Share your wit and wisdom by supplying a caption for our cartoon, and win an SP601 Evaluation Kit…66

3636

5454

Xcell Journal recently received 2010 APEX Awards of Excellence in the

categories “Magazine & Journal Writing” and “Magazine & Journal Design and Layout.”

CHANCE TO

WIN A SPARTAN-6

FPGA KIT – PAGE 66

by Mike SantariniPublisher, Xcell JournalXilinx, [email protected]

Electric motors are everywhere—runningtoys and appliances in your home, coolingyour office and powering cars, trains and fac-tory assembly lines worldwide. They are evenfound in robots on Mars. Given this widerange of applications, it isn’t surprising thatthere are dozens of types of electric motors,each with advantages and disadvantages.Function, size, power consumption/efficien-cy, performance, reliability, product life and,of course, cost are all attributes companiestake into account when designing new elec-tric motors and deciding which ones toincorporate in their products.

Today, however, the electric motor is atthe dawning of a new era of maximum effi-ciency and automation built upon highlysophisticated FPGA-powered motor con-trol systems. These new systems are juststarting to make their way into the facto-ries that create the world’s products andconsume a huge percentage of the world’spower. In the first quarter of 2011, Xilinxwill deliver a Targeted Design Platformthat will allow companies to quickly devel-op the most advanced motor control sys-tems to date and help factories reach newlevels of automation and efficiency.

8 Xcell Journal Fourth Quarter 2010

Xilinx FPGAs: Creating a Greener Future for Industrial Motor Control

FPGAs are advancing the sophistication of industrialmotor control products. An upcoming TargetedDesign Platform aims to push it further.

FPGAs are advancing the sophistication of industrialmotor control products. An upcoming TargetedDesign Platform aims to push it further.

COVER STORY

The Growing Complexity of Factory Motor ControlJoe Mallett, senior product line manager inXilinx’s industrial, scientific and medicalgroup, breaks industrial motor control intofive elements: communications, control,drive, feedback and diagnostics.

At a minimum, said Mallett, all industri-al motor control systems employ some typeof communications to the outside world toallow a motor to work with other motors viamaster controllers or enterprise software.“Traditionally, the communications part ofmotor control for the factory has been basedon serial communications, but it is rapidlymoving to Ethernet due to the real-timerequirements of communications, especiallyin safety applications, and to more easilylink with the enterprise, which has beenEthernet based for years,” said Mallett.

Then comes an element called “control,”which is the term the industry generally usesto describe the algorithm at the heart of themotor control system. This algorithm con-trols another element called drive—essen-tially, the silicon power devices that switchthe current on and off to drive the motor.The control and drive work together inwhat’s called “the control and drive loop” torun motors at optimum performance andefficiency given their application.

The next element in a motor control sys-tem is feedback. “These are sensors thatmonitor various aspects of a motor in realtime, including position, speed and torque,among others,” said Mallett. “To ensure reli-ability, many companies are trying to mini-mize the number of sensors they use in theirmotor control systems. As such, many aremoving to sensorless control, which requireshigh-performance, advanced algorithms toretrieve the feedback information.”

The final piece of the system puzzle isdiagnostics. “Being able to predict when amotor is likely to fail is becoming a com-mon requirement,” said Mallett. Here

medical markets at Xilinx, pointed out thatit’s quite complex to create a motor controlunit that makes even one motor more effi-cient because for maximum efficiency, thecontrol algorithms must be tailored to agiven motor’s targeted task in the factory.The complexity of the algorithm growsexponentially if it needs to control multiplemotors and even more so if the motor needsto incorporate feedback or diagnostics.

Corradi explained that even the mostbasic motor control systems employ com-plex algorithms to regulate the amount ofpower going into a motor or series ofmotors. Incorporating feedback and diag-nostics adds another level of complexity,but delivers systems that will instantly warnoperators if there is a potential problemwith a motor, and perhaps identify whatthe problem is and gauge its severity. Themost advanced systems even estimate howlong the motor will last given the problem.

“All motors will eventually wear out, butit’s extremely valuable if you can estimate ifand when it will happen,” said Corradi. “Ifa motor in an assembly line fails unexpect-edly, it can bring an entire factory to astandstill. Creating algorithms to controlthe efficiency of a motor is very complex,but adding these diagnostic capabilities ontop of efficiency controls requires a far moresophisticated algorithm, and a sophisticateddevice to execute that algorithm.”

The most advanced control systemstoday include active, real-time safety featuressuch as sensors that spot cracks in machineryand stop the machines in milliseconds toavoid catastrophic failure or conditions thatcould injure factory workers. Mallett notesthat in many cases these advanced safety fea-tures are also becoming standardized andmandated by government. The IEC 61508standards effort, for example, stipulates thatcontrol systems include functionality todetect potentially dangerous conditions andprevent hazardous events. In Europe, the

too, “prediction and diagnostics requirefast decisions and sophisticated signal-processing capabilities.”

In recent years, most of the R&D andwork in the motor control segment—whether it be for factory equipment, auto-mobiles or even toys—has focused onimproving motor efficiency.

Factories depend heavily on electricmotors to power robots, tools, assembly linesand HVAC (heating, cooling and ventila-tion) systems. Many run all this machinery24 hours a day, seven days a week, to fillorders and keep revenue coming in. As such,reliability and efficiency are paramount, as anunexpected failure in an electric motor canbring an entire assembly line or factory to ahalt, and may even harm workers.

Advanced Motor Control Combines Efficiency, Reliability and SafetyFurther, with 24/7 operation, energy con-sumption has a drastic impact on the totalcost of ownership and bottom line. In thefactory space, efficiency is increasinglymandatory, said Mallett.

“It’s always great to drive down energybills, but the fact is, many factories have nochoice in the matter,” said Mallett. “Theyhave to comply with government mandatesto reduce the pollution they create and theenergy they consume. However, most facto-ries can’t afford to shut down for a long peri-od of time and replace their entiremanufacturing lines with newer, more effi-cient ones. What they would prefer to do isrun their existing machinery more efficient-ly and swap in newer, more efficient motorsover time, as needed—keeping factorydowntime to a minimum.” To that end,equipment manufacturers are racing todevelop the most efficient motor and con-trol combination possible to help factoriesmeet their energy-savings goals, Mallett said.

However, Giulio Corradi, ISM systemsarchitect for the industrial, scientific and

Fourth Quarter 2010 Xcell Journal 9

‘Creat ing a lgor i thms to contro l the e f f ic iency o f a motor is very complex, and adding these d iagnost ic capabi l i t ies on top o f e f f ic iency contro ls requires a far more sophis t ica ted a lgor i thm.’

COVER STORY

Machinery Directive 2006/42/EC went intoeffect in December 2009, mandating thatmachinery manufacturers provide productsthat comply with IEC 61508.

But motor control in the factory is notrestricted to monitoring efficiency, safetyand motor longevity, Corradi explained.Companies also leverage motor controls tomake robots perform ever-more-sophisticat-ed and exacting functions. “A single robothas several motors and they must be able toperform several functions,” said Corradi.“They must be programmable to changetasks and they must work in concert withother robots. The algorithms to run themotors in a robot are extremely advanced.”

Even an HVAC system motor control isvitally important. Many factories use chemi-cals, solvents and contaminants and mustventilate them properly. Clean rooms mustalso minimize contaminants. Thesemachines must operate correctly all the time.

Mallett points out that in modern facto-ries all these systems—assembly lines, robots

and HVAC—are interrelated and intercon-nected. They must be orchestrated in concertwith one another. In many cases, newer sys-tems use high-speed wired interconnect stan-dards such as Industrial Ethernet, while someare starting to even use wireless communica-tions. Since factories tend to be noisy envi-ronments, the latter trend adds further layersof complexity to motor control designs.

FPGAs Displacing Off-the-Shelf Processors in Advanced Motor ControlBecause of all this complexity on the factoryfloor, the demand for processing perform-ance is growing exponentially and program-mable logic control is increasingly displacingtraditional processor-based motor control.

Historically, engineers have used micro-controllers for factory-class motor control,said Corradi. But that design approach ishitting a wall. “The new factory systems areso sophisticated now and integrate so manyfunctions that you would have to use sever-al microcontrollers, and often the biggest

and most expensive ones,” he said. “Then,after you’ve done that, you are fairly restrict-ed in what functionality you can add to yourmotor control. With an FPGA, you can addall the advanced features required to a singlechip and even add new functions after you’vedeployed the system in a factory.”

A prime example of this elevated com-plexity is a new trend called “networkedcontrol systems,” in which motor controlfeedback is routed via networking toenterprise software systems, not just to alocal operator’s station. In this application,“not only are the control algorithms morecomplex, but also, the network is essen-tially ‘in-the-loop,’ demanding real-timeperformance far beyond what standardoff-the-shelf processors can handle,”Corradi said. “By contrast, FPGAs canperform networking and control supportsimultaneously” (see Figure 2).

“A single FPGA can really do the workof dozens of microcontrollers,” saidCorradi. “It’s much easier to have this com-


Motor

Enterprise toFactory Bridge/Gateway

Industrial Networkingand Field Bus

Figure 1 – Xilinx FPGAs play an increasing role in advanced motor control at multiple points in the factory and even the connection between the factory and the enterprise.

COVER STORY


plexity in one device rather than distributedacross a board with many processors.”

What’s more, FPGAs excel at parallel aswell as serial operations. This means design-ers can program a single FPGA to coordi-nate multiple motor controller functions fora variety of motors simultaneously. The real-time requirements don’t stop there, but alsoinclude functional safety features, with theconcomitant need to shut down a line inmilliseconds if attached sensors detectunsafe conditions. “Functional safety isenabled within FPGAs by using silicon-spe-cific functions, dedicated hardware IP and asoftware stack that, combined, make thecomplete system,” said Corradi.

Corradi noted that customers can leveragean FPGA’s reliability and the design tech-niques Xilinx has refined over 20 years foraerospace applications to implement safetyfeatures in their motor control systems.“Companies that add safety features to theirmotor control units typically have to designspecific chips dedicated to safety,” saidCorradi. “In an FPGA, you can partition asegment of the chip just for this purpose andhave a clear separation between the safety andnonsafety segments of the chip. This makes iteasier if they need to create derivative systemsor upgrade systems, because they can simplyleave the safety part of the chip alone.”

Another option would be to add dupli-cate or triplicate functions in the design toensure space-quality safety controls inthese systems.

In fact, the companies that specialize inthe safety aspect of motor control havebeen the early adopters of FPGA technolo-gy. “They see the advantage in how theFPGA could clearly bring a reduction indevelopment times and additional reliabili-ty, and a clear separation of safety and non-safety functions,” Corradi said.

Xilinx Readies Industrial Ethernet Motor Control Targeted Design PlatformIt is for these reasons that companies areturning to FPGAs over processor-basedsystems for industrial motor control.Mallett predicts that the pace of adoptionwill advance rapidly over the next fewyears, especially once engineers get theirhands on the upcoming Xilinx TargetedDesign Platform tailored for factoryautomation and motor control.

This market-specific platform will pullall the pieces together for the end customerdeveloping advanced motor controllers.The development environment willinclude the hardware, tools and IP neededto integrate Industrial Ethernet and motorcontrol into a single platform. Corradi said

the Targeted Design Platform will alsoinclude a targeted reference design to helpcustomers shorten their development cyclesand differentiate their products.

Xilinx will base the first-generation sys-tem on a Spartan®-6 FPGA running aMicroBlaze® processor. Looking forward,one can envision a platform leveragingXilinx’s ARM® MPU-based ExtensibleProcessing Platform, further acceleratingthe role of programmable logic in motorcontrol design—potentially saving factoriesmillions of dollars and ensuring the safetyof industrial workers worldwide.

A key part of the factory automationTargeted Design Platform is a library ofgeneral-purpose and market-specific IPthat Xilinx and its alliance partners willoffer to help customers develop Spartan-6-based integrated motor controllers quickly(see related story, page 12).

Mallett predicts that as companiesbecome more familiar with programmingFPGAs for motor control systems that endup in factories, they will start to useFPGA-based motor control across otherproduct lines and for other control appli-cations. “Companies are always seekingways to build a better motor or at least abetter way to control them, and FPGAs areideal for the job,” said Mallett.

CommunicationMonitoring/Setup

Motor Control and Drive

Feedback

EthPort

EthernetPHY

CrystalOscillator

JTAG

PROM

Encoder

Encoder

ProgramMemory 16K

DataMemory 2K

InterruptController

IndustrialEthernet

UART

ClockMgmt

Position andAngle

MicroBlazeControl

Processor

Watchdog

Timer/C

ounterC

lock

Bridge (2)

ADC

ADC

ADC

ADC

ADC

FO

C

v

PLB

I/F

ADCController

Stator Current(1)

Stator Current(2)

DC Bus Voltage

PW

M

Bridge (1)

FO

C

v

PW

MS

inc3

Sin

c3S

inc3

Sin

c3S

inc3

Spartan-6FPGA

Figure 2 – Engineers can implement many motor control features on a single FPGA.

COVER STORY

by Kasper Feurer Embedded Systems Engineer Xilinx [email protected]

Richard TobinEmbedded Systems EngineerXilinx [email protected]

Manufacturers of intelligent drives, as wellas many other players in the automotiveand ISM sectors, are facing numerouschallenges to satisfy new market demandsand meet constantly evolving standards.In modern industrial and automotiveapplications, motors must provide maxi-mum efficiency, low acoustic noise, a widespeed range and reliability, all at anacceptable cost. On today’s factory floor,motor-driven equipment consumes two-thirds of total electricity energy, leading tothe urgent need to develop more energy-efficient systems. Interoperability is also acritical design requirement, as in manycases a drive will serve as a component ina large-scale process. A key factor thatinfluences this requirement is the breadthof industrial networking protocols (thefield buses) and associated device profiles,which serve to standardize the representa-tion of a drive within a network. The fieldbuses themselves (for example, CAN andProfibus) are diverse, and despite sharingthe same generic name they are not readi-ly interchangeable. In an attempt toreduce cost and improve communicationsbetween industrial controllers, field bus

providers have developed Ethernet-basedindustrial networking solutions, and sev-eral new protocols, such as EtherCAT,Profinet and EtherNet I/P among others,have emerged in recent years. However,these are also divergent technologies anddrive manufacturers struggle to supportall the major players.

Xilinx Design Services (XDS) hasaddressed all of these issues in developingan FPGA-based prototype motor controlplatform supporting the CANopen andEtherCAT interfaces for a key player inthe ISM space. Our role was to designand implement a fully functional andmodular system fit for reuse in the cus-tomer’s next-generation intelligent drives.The Xilinx® Spartan®-6 FPGA SP605Evaluation Kit Base Targeted DesignPlatform, along with third-party IP, pro-

vided state-of-the art motor control algo-rithms and industrial networking supportin a modular system architecture, yieldingan efficient and scalable design.

Choosing the FPGA PathThe customer’s existing, microcontroller-based solutions did not deliver what thecustomer wanted most: a scalable platform.A Spartan-6 FPGA-based intelligent drivecontrol system provides all the necessaryscalability, logic and compute power in asingle chip, reducing costs while also avoid-ing obsolescence. Such a platform can beupgraded for years to come with the lateststandards of industrial networking and themost efficient motor control algorithms. Inaddition, the reprogrammable nature ofFPGAs facilitates customization of a singlebase motor control system to meet specific


Revolutionizing Intelligent DriveDesign with the Spartan-6Xilinx Design Services uses the SP605 Evaluation Kitand state-of-the-art motor control IP to prototype aninterface-independent intelligent drive control system.

XCEL LENCE IN INDUSTR IAL

customer requirements, allowing easy inte-gration into the existing industrial net-work. In short, the Spartan-6 FPGA fulfillsall the challenging requirements of theindustrial space.

For newcomers to the world of FPGA-based system design, like our customer,Xilinx’s Targeted Design Platforms offerthe ideal starting point by providing arobust set of well-integrated and tested ele-ments right out of the box. You can auto-mate an even greater portion of the finaldesign by adding domain-specific and mar-ket-specific platform offerings to the Baseplatform. These targeted reference designsreduce the learning curve by demonstratingFPGA implementations of real-world con-cepts, allowing customers to put their mus-cle into the design and development of thedifferentiating features of the final product.

Our solution combined the Spartan-6SP605 Evaluation Kit with third-partyofferings—namely the NetMot FMCboard and IP from QDeSys plus industrialnetworking IP from Bosch and Beckhoff.Not only were all the basic building blocksof the desired system in place from thestart, but we could proceed with prototypedevelopment without the need for a cus-tom FPGA board, thus allowing the cus-tomer to verify the viability of this newplatform at minimum cost. To furtherenhance the time-to-market and reduce therisks involved with a first-time FPGA sys-tem design, the customer asked us to notonly deliver this prototype but also to sup-port the adoption of FPGAs in its next-generation intelligent drives.

Ultimately, both engineers and theirmanagers benefited by this approach. Theformer learned FPGA-based design faster,armed with best practices gleaned fromXDS, while the management slashed thetime it took to deliver product along withthe business risks.

Intelligent Drive Control System PrototypeThe XDS portfolio covers the entireFPGA design development cycle, fromspecification creation through coding,verification, timing closure and systemintegration. Drawing on years of experi-ence in embedded-processor system and

Depending on the combination of PLCand the type of intelligent drive (CAN orEtherCAT), the industrial network is eithera serial bus or a standard 100-MbitEthernet interface. For both solutions theprototype uses a direct line link betweenthe PLC and the motor—either a two-wireserial interface for CAN, or a standardRJ45 100Base-TX Ethernet connection forEtherCAT.

The motor control printed-circuitboard, typically one of a number of PCBsin an intelligent drive, is dedicated to con-trolling the motor via commands from thePLC. This motor control board is wherethe flexibility of an FPGA comes intoplay. Rather than a single interface andsingle motor control algorithm solution,as in the conventional ASIC/microproces-sor approach, the Spartan-6 FPGA can bereprogrammed with dedicated networkingand motor control IP blocks, as well as thecontrol software to suit the customer’s spe-cific needs. In this manner, a single FPGA-based PCB can fulfill the functions ofmany ASIC-based boards. At the sametime, it future-proofs the intelligent driveby providing a mechanism to update the IPto the latest standards.

Rather than designing this motor con-trol board from scratch, XDS exploited the

software application design, along withthe ability to integrate third-party IP,good project management practices and afully certified ISO9001 developmentprocess, XDS delivered the intelligentdrive control system prototype very earlyin the customer’s product developmentcycle. The resulting custom TargetedDesign Platform enabled the customer’sengineers to familiarize themselves withFPGA design processes and hence opti-mize the power of this new technology intheir next-generation products.

Let’s look more closely at the main com-ponents of this intelligent drive control sys-tem prototype as illustrated in Figure 1.

The programmable logic controller(PLC) operates the intelligent drive, whichis attached to the industrial network in realtime. For the purpose of this prototype, weused two PC-based PLCs to handle thetwo industrial networking standards thesystem supports: miControl mPLC for thecontroller-area network (CAN) andTwinCAT for the EtherCAT industrialEthernet field bus system. The PLC gener-ates predefined command messages (forexample, start and stop) and verifies thecorrect behavior of the motor by analyzingthe responses received (current speed, tem-perature, voltage and so on).


Industrial Network

Intelligent Drive FPGA-basedMotor Control PCB

IndustrialNetwork IP

MotorControl IP

MicroBlazewith

ControlSoftware

Spartan-6 FPGA

Figure 1 – The drive control system prototype


Targeted Design Platform concept, inte-grating all the customer’s desired elementsby combining the Xilinx Spartan-6 SP605Evaluation Kit, the NetMot FMC board,and industrial networking and motor con-trol IP, thus delivering this proof-of-con-cept prototype before the customer hadcompleted its new PCB. Figure 2 showshow we combined the various compo-nents to yield the prototype developmentplatform. As a result, the customer greatlyreduced the integration effort and wasable to explore the best design optionswithout re-engineering the final design.

The SP605 Base Targeted DesignPlatform is a general-purpose FPGA plat-form that hosts a Spartan-6 LX45T in aproven implementation, alongside manycommonly needed peripherals such asDDR3 RAM, flash memories for pro-gram/bitstream storage, UART for debugand JTAG for FPGA programming.Another key element of the SP605 as wellas all recent Xilinx boards is the FPGAMezzanine Card (FMC) connector, whichenables designers to expand the base boardwith custom functions and interfaces.

This feature of the SP605 enabled usto build on this base by using features pro-vided by the QDeSys NetMot FMC(www.qdesys.com), which implements the

power electronics required for motor con-trol, such as voltage inverters and ADCsfor obtaining sensor data. You can connectthe motor directly to these inputs/outputsas shown in Figure 2. The NetMot FMCalso expands the industrial network con-nectivity features of the SP605 by addingtwo CAN and two Ethernet physical-layerinterfaces. They are accessible to the FPGAvia the FMC connector and a PLC overstandard interfaces.

The test PC functions as both a host forthe PLC software and an FPGA program-ming/debug platform using the UART andJTAG interfaces. In addition, we used thistest PC to develop the MicroBlaze™embedded-processor system for theSP605’s LX45T FPGA with Xilinx’s ISE®

12.1 Design Suite. This embedded systemis responsible for processing the commandsreceived from the PLC and controlling themotor accordingly.

The MicroBlaze software application,networking and motor control IP blocksseen in Figure 2 represent the modules ofthe design that change depending on theinterface (EtherCAT or CANopen) andmotor type chosen. One of the main chal-lenges for XDS was to ensure that theprocess of switching between these optionsbe as simple as possible, thus ensuring that

the customer would be able to reuse thesame methods in the future for furtherindustrial network types like Profinet andfor new motors.

Implementation Details Let’s examine these parts of the Spartan-6embedded system in greater detail. As shownin Figure 3, the motor control IP block thatwe used—the Xilinx Motor Control Library(XMCLIB)—is identical for both versions ofthe design. This custom IP core, which per-mits the FPGA to control the NetMotFMC’s motor power electronics, plugs direct-ly into Xilinx’s Embedded Development Kit(EDK). This allowed us to add the IP to ourembedded design from the Xilinx PlatformStudio (XPS) project and configure it to suitthe motor that was connected to the FPGAvia the FMC connector. The XMCLIB soft-ware driver is a set of low-level functions giv-ing the motor control application access tothe XMCLIB register interface.

The networking IP, on the other hand, iswhere the system differs for the two variants.For the CAN version of the design we chosethe standard LogiCORE™ IP XPSController Area Network, delivered with theISE 12.1 design suite and licensed by BoschGmbH. For the EtherCAT version, we usedBeckhoff ’s EtherCAT Slave Controller IP


Test PC

PLCTwinCAT / mPLC

JTAGProgramming I/F

NetMot FMC

Ethernet /CAN PHY

I/Fs

MotorPower

Electronics

FMC Connector Flash DDR3

MicroBlazeSoftware Application

Motor

StandardPeripherals

(MPMC,UARTLite

MDM, etc.)

NetworkingIP

MotorControl IP

MicroB

laze

UART DebugConsole

SP

605

LX45T

FP

GA

UART

JTAG

Figure 2 – Prototype Spartan-6 FPGA-based motor control board



Core (www.beckhoff.com) for Xilinx FPGAs.Both of these IP cores are also available fromthe XPS tools’ IP Catalog tab, making inte-gration and configuration in the design verystraightforward. In this case, instead of usinga simple driver to provide access to the net-working IP, we used Port’s (www.port.de)CANopen and EtherCAT Stack solutions,which offer fully featured protocol imple-mentations right out of the box.

Finally, we designed a custom embed-ded-software application to run onMicrium’s (www.micrium.com) µC/OS-IIon the MicroBlaze processor system. Thisembedded operating system enhances theprototype system’s real-time capabilitiesand provides multitasking, message queuesand semaphores, among other features.

We also recognized that it was importantto structure the application in a way thatwould allow us to retarget it to differentnetwork interfaces. To achieve this, wedesigned an interface abstraction layer thatlets us encapsulate the communications andmotor control elements of the software.

On one side of this interface (Figure 4), weimplemented a networking module (Port’sCANopen or EtherCAT) to manage the com-munications for the networking IP availablein the system. These modules plug in seam-lessly to our interface abstraction layer. At thetop level of these stacks, we pass communica-tions and control data (such as PDOs, SDOsand NMT state transitions) into the abstrac-tion layer, which interprets the data and pres-ents it to the motor control application ascommands such as start/stop or rotate at aspecified velocity or to a specified position.

To determine a common set of messagesand commands for the interface abstraction,we researched existing publications in thearea of industrial networking and encoun-tered the IEC 61800-7 standard. For theexisting field bus technologies, severalschemes are used to standardize the commu-nications with a drive device (such as CiA-402 for CANopen or PROFIdrive forProfinet). IEC 61800-7 presents a commonrepresentation of a drive and proceeds to pro-vide a set of mappings between this represen-tation and the existing drive profiles.

The concepts presented in this standardallowed us to develop our interface abstrac-tion, which in turn allowed us to encapsu-late the networking component of thesystem. We can therefore change the net-working interface present in the system, andwe only need to tailor a small portion of thesoftware to make it compatible with theexisting motor control application.

Going ForwardThe successful delivery of the intelligent drivecontrol system prototype has clearly illustrat-ed the potential power of FPGAs in industri-al Ethernet networking, field buses and motorcontrol. Although some work remains todevelop a fully featured product, XDS tai-lored a Targeted Design Platform andenhanced it for the customer, creating a cus-tom solution that will greatly reduce the risksand effort required to come to a final, engi-neered product. As a next step, XDS is inves-tigating expanding this Targeted DesignPlatform to support the Profinet IP core andstack, demonstrating that the modularapproach and the design practices adopted arevery effective for the customer.

Spartan-6 LX45T

MicroBlaze

Micrium µC/OS-II

Motor Control Application

InterfaceAbstraction

Layer

PortEtherCAT

ORCANopen

EtherCAT IPOR

CAN IP

NetMot FMCEthernet Ports

NetMot FMCPower Electronics

StandardXilinx SW

Drivers

XMCLIBSW Driver

Standard XilinxIP

(UART, MDM, MPMC, etc.)

XMCLIB IP

MotorControl

Application

CANopen

Message passing

Interface

EtherCAT

Figure 3 – CAN/EtherCAT embedded system

Figure 4 – Interface abstraction layer


by Adam Peter TaylorPrincipal EngineerEADS [email protected]

Dramatic surges in FPGA technology,device size and capabilities have over thelast few years increased the number ofpotential applications that FPGAs canimplement. Increasingly, these applica-tions are in areas that demand high relia-bility, such as aerospace, automotive ormedical. Such applications must func-tion within a harsh operating environ-ment, which can also affect the systemperformance. This demand for high reli-ability coupled with use in rugged envi-ronments often means you as theengineer must take additional care in thedesign and implementation of the statemachines (as well as all accompanyinglogic) inside your FPGA to ensure theycan function within the requirements.

One of the major causes of errors with-in state machines is single-event upsetscaused by either a high-energy neutron oran alpha particle striking sensitive sectionsof the device silicon. SEUs can cause a bitto flip its state (0 -> 1 or 1 -> 0), resultingin an error in device functionality thatcould potentially lead to the loss of the sys-tem or even endanger life if incorrectlyhandled. Because these SEUs do not resultin any permanent damage to the deviceitself, they are called soft errors.


Using FPGAs in Mission-Critical SystemsUsing FPGAs in Mission-Critical SystemsSEU-resistant state machines hold the keyto adapting programmable logic devicesfor high-reliability applications.

XCEL LENCE IN AEROSPACE & DEFENSE

The backbone of most FPGA design isthe finite state machine, a design methodol-ogy that engineers use to implement con-trol, data flow and algorithmic functions.When implementing state machines withinFPGAs, designers will choose one of twostyles, binary or “one hot,” although inmany cases most engineers allow the synthe-sis tool to determine the final encodingscheme. Each implementation scheme pres-ents its own challenges when designing reli-able state machines for mission-critical

systems. Indeed, even a simple state machinecan encounter several problems (Figure 1).You must pay close attention to the encod-ing scheme and in many cases take the deci-sion about the final implementationencoding away from the synthesis tool.

Detection SchemesLet’s first look at binary implementations(sequential or “gray” encoding), whichoften have leftover, unused states that the

startup following reset release. Typically,these states also keep the outputs in a safestate; should they be accidentally entered,the machine will cycle around to its idlestate again.

One-hot state machines have one flip-flop for each state, but only the currentstate is set high at any one time.Corruption of the machine by havingmore than one flip-flop set high canresult in unexpected outcomes. You canprotect a one-hot machine from errors by

monitoring the parity of the state regis-ters. Should you detect a parity error, youcan reset the machine to its idle state or toanother predetermined state.

With both of these methods, the statemachine’s outputs go to safe states and thestate machine restarts from its idle position.State machines that use these methods canbe said to be “SEU detecting,” as they arecapable of detecting and recovering froman SEU, although the state machines’ oper-

state machine does not enter when it isfunctioning normally. Designers mustaddress these unused states to ensure thatthe state machine will gracefully recover inthe event that it should accidentally enteran illegal state. There are two main meth-ods of achieving this recovery. The first is todeclare all 2N number of states when defin-ing the state machine signal and cover theunused states with the “others clause” at theend of the case statement. The othersclause will typically set the outputs to a safe

state and send the state machine back to itsidle state or another state, as identified bythe design engineer. This approach willrequire the use of synthesis constraints toprevent the synthesis tool from optimizingthese unused states from the design, asthere are no valid entry points. This typi-cally means synthesis constraints within thebody of the RTL code (“syn_keep”).

The second method of handling theunused states is to cycle through them at



ReadWrite

Incorre

ct sta

te transiti

on Incorrect state transition

State change to

unmapped state

State change to unmapped state

No Recovery Back to Idle State

Idle

State Encoding 00

WriteState Encoding 10

ReadState Encoding 01

Unmapped StateState Encoding 11

Figure 1 – Even a simple state machine can encounter several types of errors.

ation will be interrupted. You must takecare during synthesis to ensure that registerreplication does not result in registers witha high fanout being reproduced and henceleft unaddressed by the detection scheme.Take care also to ensure that the error doesnot affect other state machines that thismachine interacts with.

Many synthesis tools offer the option ofimplementing a “safe state machine”option. This option often includes more

logic to detect the state machine entering anillegal state and send it back to a legal one—normally the reset state. For a high-reliabilityapplication, design engineers can detect andverify these illegal state entries more easilyby implementing any of the previouslydescribed methods. Using these approaches,the designers must also take into accountwhat would happen should the detectionlogic suffer from an SEU. What effectwould this have upon the reliability of the

design? Figure 2 is a flow chart that attemptsto map out the decision process for creatingreliable state machines.

Correction SchemesThe techniques presented so far detect orprevent an incorrect change from one legalstate to another legal state. Dependingupon the end application, this could resultin anything from a momentary systemerror to the complete loss of the mission.


State MachineRequired

ProtectionRequired

Implement StateMachine as

Normal

ProtectionRequired

CorrectionRequired

Detection orCorrection

Implement StateMachine UsingTriple-ModularRedundancy

Implement StateMachine UsingState Encodingwith a Hamming

Distance of Three

Sequential or One-Hot

Implementation

DetectionRequired

Sequential

One-Hot

Implement StateMachine to CycleThrough Unused

States

Implement StateMachine to

Preserve UnusedStates

Implement StateMachine UsingState Encodingwith a HammingDistance of Two

Implement StateMachine With

Parity Protection


Figure 2 – This flow chart maps out the decision process for creating reliable state machines.



Techniques for detecting and fixingincorrect legal transitions are triple-modu-lar redundancy and Hamming encoding.The latter provides a Hamming distance ofthree and covers all possible adjacent states.A simpler technique for preventing legaltransitions is the use of Hamming encod-ing with a Hamming distance of two(rather than three) between the states, andnot covering the adjacent states. This, how-ever, will increase the number of registersyour design will require.

Triple-modular redundancy, for its part,involves implementing three instantiationsof the state machine with majority votingupon the outputs and the next state. This isthe simpler of the two approaches andmany engineers have used it in a numberof applications over the years. Typically, aTMR implementation will require spatialseparation of the logic within the FPGA toensure that an SEU does not corrupt morethan one of the three instantiations. It isalso important to remove any registersfrom the voter logic, since they can create asingle-point failure in which an SEU couldaffect all three machines.

The use of a state machine with encod-ing that provides a Hamming distance ofthree between states will ensure both SEUdetection and correction. This guaranteesthat more than a single bit will changebetween states, meaning an SEU cannotmake the state transition from one legalstate to another erroneously. The use of aHamming distance of two between states issimilar to the implementation for thesequential machine, where the unused statesare addressed by the “when others” clause orreset cycling. However, as the states areexplicitly declared to be separate from eachother by a Hamming distance of two with-in the RTL, the state machine cannot movefrom one legal state to another erroneouslyand will instead go to its idle state shouldthe machine enter an illegal state. This pro-vides a more robust implementation thanthe binary one mentioned above.

If you wish to implement a state machinethat will continue to function correctlyshould an SEU corrupt the current state reg-ister, you can do so by using a Hamming-code implementation with a Hamming

distance of three and ensuring its adjacentstates are also addressed within the statemachine. Adjacent states are those which areone bit different from the state register andhence achievable should an SEU occur. Thisuse of states adjacent to the valid state tocorrect for the error will result in N*(M+1)states, where N is the number of states andM is the number of bits within the state reg-ister. It’s possible to make a small high-relia-bility state machine using this technique,but crafting a large one can be so complicat-ed as to be prohibitive. The extra logic foot-print associated with this approach couldalso result in lower timing performance.

Deadlock and Other IssuesThere are other issues to consider whendesigning a high-reliability state machinebeyond the state encoding schemes.Deadlock can occur when the state machineenters a state from which it is never able toleave. An example would be one statemachine awaiting an input from a secondstate machine that has entered an illegal stateand hence been reset to idle without gener-ating the needed signal. To avoid deadlock,it is therefore good practice to provide time-out counters on critical state machines.Wherever possible, these counters shouldnot be included inside the state machine butplaced within a separate process that outputsa pulse when it reaches its terminal count.Be sure to write these counters in such a wayas to make them reliable.

When checking that a counter hasreached its terminal count, it is preferableto use the greater-than-or-equal-to opera-tor, as opposed to just the equal-to oper-ator. This is to prevent an SEU fromoccurring near the terminal count andhence no output being generated. Youshould declare integer counters to apower of two and, if you are usingVHDL, they should also be modulo thepower of two to ensure in simulation thatthey will wrap around as they will in thefinal implementation [ count <= (count +1) Mod 16; for a 0- to 15-bit integercounter]. Unsigned counters do notrequire this, since there is no simulationmismatch between RTL and post-routesimulation regarding wraparound.

You can replicate data paths within thedesign and compare outputs on a cycle-by-cycle basis to detect whether one ofthem has been subjected to an SEUevent. Wherever possible, edge-detect sig-nals to enable the design to cope withinputs that are stuck high or stuck low.You should analyze each module withinthe design at design time to determinehow it will react to stuck-high or stuck-low inputs. This will ensure it is possibleto detect these errors and that they can-not have an adverse effect upon the func-tion of the module.

Simulation and VerificationOnce you have designed the statemachine, you will of course wish to simu-late it to ensure it meets requirements. Itis also possible during simulation to try totest how the machine will behave whensubjected to an SEU (it is worth notingthat this can take considerable time andeffort to achieve in full). You can corruptthe state machine using the force optionwithin the ModelSim utilities library froma test bench. When attempting to simu-late the effects of SEUs, it is important toremember that they are asynchronousevents that can occur at any point in time.There are also third-party tools availablethat can test designs for SEU perform-ance. One free offering, which injectsSEUs into the design, is the ESA SingleEvent Upsets Simulation Tool, available athttp://www.nebrija.es/~jmaestro/esa/.

Designers can undertake a similarapproach upon the post-route simulation.However, it’s a better idea to verify thecorrect performance of the RTL code withregard to functionality and SEU toleranceand then conduct a formal equivalencecheck between the RTL and the post-route simulation netlist to ensure thatthey both behave in the same manner.

FPGAs will continue to be used in anincreasing number of mission-criticalapplications as the growth in capabilityand performance demanded of these sys-tems increases. The techniques present-ed here will provide you with themethodology required to make such adesign a success.

by Mansoor NaseerAssistant ProfessorBahria University, E-8, Islamabad, [email protected]

Mobile telephony, the Internet and stream-ing applications enable people to communi-cate over long distances and findinformation and entertainment online. Butwith the explosion in mobile communica-tions usage, the tremendous volume ofinformation transmitted in the air is vulner-able to security risks. In order to maintainsecurity, engineers use cryptographic tech-niques to disguise the data. For the GlobalSystem for Mobile communication (GSM)standard, their algorithm of choice is theA5/1. Using A5/1, the encryption anddecryption are performed over the network,both at the handset and at the base station.

This article will walk you through animplementation of the A5/1 stream cipher,taking note of the code snippets used toimplement different blocks of the cipher.We will also take a look at a couple of newand different twists on the original A5/1algorithm for enhanced security. In partic-ular, I am changing how the feedback func-tion behaves as well as reworking thecombiner output. For this work, I used aXilinx® Virtex®-4 LX series FPGA forhardware implementation and the XilinxISE® for synthesis. I used MentorGraphics’ ModelSim 6.0 for simulation.

Before plunging into implementationdetails, let us begin with a brief descriptionof the A5/1 algorithm.


Virtex-4 FPGA Forms Foundation for Secure GSM StandardsVirtex-4 FPGA Forms Foundation for Secure GSM StandardsHere are some design mechanisms and tips you can try when implementing current and future A5/x algorithms using Xilinx FPGAs.Here are some design mechanisms and tips you can try when implementing current and future A5/x algorithms using Xilinx FPGAs.

XCEL LENCE IN WIRELESS COMMS

Ins and Outs of A5/1 Algorithm In the GSM standard, information travelsover the airwaves as sequences of frames.Frame sequences contain a digitized versionof the user’s voice as well as streaming audioand video data used in mobile Internetbrowsing. The system transmits each framesequentially, 4.6 milliseconds apart. Aframe consists of 228 bits; the first 114 bitsare called “uplink data” and the next 114bits are known as “downlink data.”

We initialize A5/1 using a 64-bit secretkey together with a known frame numberthat is 22 bits wide. The next step is tofeed the input bits of the frame and secretkey into three linear feedback shift regis-ters (LFSRs) that are 19, 22 and 23 bits insize respectively. The feedback is providedafter XOR’ing of selected taps, whichhelps increase randomness in the finaloutput. I will call the three LFSRs R0, R1 and R2, such that each LFSR followsa primitive function. R0, R1 and R2 respectively use x19+x5+x2+x+1,x22+x+1 and x23+x15+x2+x+1 as theirprimitive functions.

other. The session key KC takes 64cycles in loading whereas Fn takesanother 22 cycles. In every cycle, theinput bit is XOR’ed with a linear feed-back value coming from selective taps.

3. Next, allow the LFSRs to clock for100 cycles, a process called mixing.Linear feedback is active and so arethe chip enables, allowing a shift/no-shift that is also active.

4. After the mixing step is complete, thenext 228 cycles result in the final out-put key stream. The uplink and down-link streams (each consisting of 114bits) are used for decryption andencryption operations. This stepemploys nonlinear feedback and amodified combiner as against the tra-ditional A5/1 algorithm defined above.

5. You can repeat the previous steps forevery new frame Fn. The frame num-bers are updated where the session keyremains unchanged until authentica-tion protocols provoke it.

Hardware ImplementationI did the hardware implementation of theproposed algorithm on Xilinx Virtex-4series FPGAs. Because we designed theimplementation logic for compactness, Itargeted the smallest device in the Virtex-4family, namely the XC4VLX15. Figure 1shows the overall architecture of the top-level module.

The top level comprises two buildingblocks and glue logic. I designed the gluelogic to provide chip connectivity, startlogic and data-loading features. There aretwo other building blocks, Majority/Taprule and LFSR bank, each with its ownunique functions.

The Majority/Tap rule block receivesthe selective tap positions from the LFSRbank and performs two separate functionsindependently. As the selection criteria forMajority or Tap rule are based on the lasttap of R0, R1 and R2, the logic forMajority rule and Tap rule is implementedin parallel. I also performed one other opti-mization in the implementation ofMajority rule. The pseudo algorithm forselecting majorities is described as:

The clocking sequences of LFSRs use aparticular mechanism; otherwise, it wouldbe very easy to decipher the output ofLFSRs. Therefore, clocking is based on amajority rule. This rule uses one tap eachfrom R0, R1 and R2, and determineswhether the taps are mainly 0 or 1. That is,two or more taps with a value of 0 imply amajority value 0; otherwise, it is 1. Oncethe majority is established, clock the LFSRsif their input bit matches the majority bit.If the input bit does not match the majori-ty value, no input bit will be inserted in theLFSR and hence, no shifting will takeplace. Additionally, we get the output byXOR’ing each output bit of R0, R1 and R2in the original A5/1. In this work, I havereplaced the standard XOR’ing with amore sophisticated output-generatingfunction that we call a “combiner.”

The key setup routine of the A5/1 algo-rithm is as follows:

1. Initialize R0, R1 and R2 with 0.

2. Load the value of session key KC andframe Fn into the LFSRs, one after the


Id23

Id22

Id19

R0

R1

R2

clk

frame

key

start

Load Logic

LFSR Bank

Com

bine

r

ce19

ce22

ce23

Majority / TapRule

ce_universal

Figure 1 – Architectural diagram of the top module


If (R0[13] + R1[12] + R2[16])

>= 2 then maj1 = 1

else then maj1 = 0

Similarly, we find the value of maj2based on tap numbers 7, 5 and 8. Based onthese majority functions, we define thechip enable for R0 as:

If (R0[13]== maj1) AND (R0[7] ==

maj2) then chip enable for Majority

rule is 1 else 0.

I implemented the pseudo code forselecting maj1 and maj2 in the form ofsimple and/or gates. Similarly, the pseudocode for chip enables maps onto a simpleXNOR function, combining all the logicand reducing area.

The technique implements the Tap rulealgorithm using case statements. The orig-inal A5/1 algorithm determines the shiftoperation or chip enable for LFSRs basedon the majority rule only. As I have intro-duced a modified Majority rule and anadditional Tap rule for shifting LFSRs, Ineeded another mechanism. Thus, makethe final selection of chip enables for R0,

R1 and R2 based on the PoliSel signal,which determines whether to output thechip enable signal established by Majorityor Tap rule. The PoliSel signal imple-ments this equation: PoliSel = R0[18]XOR R1[21] XOR R2[22].

The LFSR BankThe LFSR bank module houses multiplefunctions and implementations. Thismodule instantiates a core LFSR withvarying parameters for establishing R0,R1 and R2 LFSRs. The width of theseLFSRs is 19, 22 and 23 respectively. Thismodule also instantiates an improvedlinear feedback function from A5/1.Additionally, the LFSR bank imple-ments our novel nonlinear feedbacklogic. I also implemented in this blockthe combiner that produces the finaloutput. Figure 2 shows the block dia-gram for the LFSR bank. Here, theLFSRs are serial-in, parallel-out shiftregisters. Xilinx Virtex-4 devices allowconvenient mapping of these shift regis-ters onto hardware. Using the standard

template for LFSRs as shown below, wecan infer generic serial registers.

module lsr (clk, ce, si, po);

parameter width = 18;

input clk, ce, si;

output [width-1:0] po;

reg [width-1:0] shftreg = 0;

always @ (posedge clk)

begin

if (ce) shftreg <=

{shftreg[width-2:0], si};

end

assign po = shftreg;

endmodule

With the use of parameters, it becomeseasy to instantiate LFSRs at the higherlevel. By passing the parameters using the“defparam” feature of Verilog, we can inferthe required LFSRs of width 19, 22 and23 bits at the top level. The inference isconveniently done by using the followinglines of code:

lsr lfsr19bit (clk, ce19, si19,

lfsr19);

defparam lfsr19bit.width = 19;


ModifiedInput Logic

Linear Feedback R0

Linear Feedback R1

Linear Feedback R2

Combiner

NonlinearFeedbackController

SupportiveNonlinearFunctions

Figure 2 – Logic blocks at the input and output of R0, R1 and R2 in LFSR bank module



(The latter controls the parameter widthinside instance lsr.)

As seen from the following code snippet,the input loading path of the LFSRs is well-defined in terms of simple XOR operations.

wire frameXor19, keyXor19, si19input;

assign frameXor19 = frame ^ si19lfb;

assign keyXor19 = key ^ si19lfb;

assign si19input = (frameXor19 ^

keyXor19) ? frameXor19:keyXor19;

The existence of asynchronous gatesincreases the length of the critical path.However, here it is well within tolerablelimits, as the critical path in the currentcase consists of only three XORs and onemultiplexer.

After loading is completed, the feed-back path chooses between either linear ornonlinear feedback. The nonlinear feed-back controller makes this selection deci-sion. A choice of nonlinear feedbacksuddenly changes the design specificationsand parameters. For instance, the nonlin-ear feedback function is activated periodi-cally every 11 clock cycles and remainsactive for 10 clock cycles. To keep theimplementation area small, we have imple-mented counters rather than adders toenforce the periodic behavior of the non-linear feedback function. The imple-mentation requires separate counters for

R0, R1 and R2 for counting the numberof 1’s. The pseudo algorithm for thisimplementation is as follows:

1. Detect falling edge on ldlfsr signal.

2. Start counting for 100 cycles.

3. After 100 cycles, for every 10 cyclesactivate the stream breaker, then deactivate for 10 cycles until 114clocks elapse.

Stream breaker pseudo code:

4. Sum the contents of lfsr 19, 22, 23.

5. If count19 > 10, count22 > 11 orcount23 > 11

6. pass 1 into the LFSR

7. else

8. pass 0 into the LFSR.

Since the decision of inserting 1’s or 0’sin the input of R0, R1 and R2 needs to betaken every clock cycle, once the nonlinearfeedback is activated, the number of coun-ters and adders significantly increases inarea and critical path.

The combination of chip enable andinput causes the nonlinear feedback path toassume 0 or 1 based on the algorithm of thestream breaker. Figure 3 shows a hardwarebuilding block implementing this scheme.

Using counters with defined thresh-olds assists the implementation of the

nonlinear feedback controller. Do notallow the counter to decrement afterapproaching 0 or to increment afterreaching the high value corresponding toR0, R1 and R2, namely 19, 22 and 23respectively. The following code snippetprovides one example implementation ofthe counter:

reg [4:0] sum19=5'd0;

wire inc_sum19, dec_sum19;

assign inc_sum19 = si19 & ce19;

assign dec_sum19 = lfsr19[18] & ce19;

always @ (posedge clk)

begin

case ({inc_sum19, dec_sum19})

2'b01:

begin

if (sum19 == 5'd0) sum19<=5'd0;

else sum19<=sum19-1;

end

2'b10:

begin

if (sum19 == 5'd18) sum19<=5'd18;

else sum19<=sum19+1;

end

default: sum19 <= sum19;

endcase

end

This code also shows how the incre-ment and decrement functions aredependent on chip enable signals and notjust the value at the input of LFSR onevery rising edge of the clock.

It is extremely important to imple-ment tight and secure encryption algo-rithms for future mobile communicationstandards. As the bandwidth require-ments are rapidly evolving and streaming-data volume is on an exponential rise, it isall the more necessary to be able to imple-ment vary fast and parallel ciphers fordata security. A Xilinx Virtex-4 deviceoffers the perfect solution for parallelism,high speed and bandwidth requirementsin this application. With low-voltageoperations and ultralow power consump-tion, this platform can easily adapt to theupcoming secure mobile applications.With the small tricks and snippets ofcode presented here, you can easily mapthe A5/1 and its future variants ontoXilinx FPGAs.

sum 19

sum 23

sum 22

10

11

11

ce 19

ce 22

ce 23

fbsel 19

fbsel 22

fbsel 23

> =

> =

> =

Figure 3 – Nonlinear feedback controller and input selection values


by Michaela BlottSenior Research Engineer Xilinx, Inc. [email protected]

Jonathan EllithorpePhD CandidateStanford University

Nick McKeownProfessor of Electrical Engineering and Computer ScienceStanford University

Kees VissersDistinguished EngineerXilinx, Inc.

Hongyi ZengPhD CandidateStanford University

Stanford University, together with XilinxResearch Labs, is building a second-gener-ation high-speed networking design plat-form called the NetFPGA-10G specificallyfor the research community. The new plat-form, which is poised for completion thisyear, uses state-of-the art technology tohelp researchers quickly build fast, complexprototypes that will solve the next crop oftechnological problems in the networkingdomain. As with the first-generation plat-form, which universities worldwide haveeagerly adopted, we hope the new platformwill spawn an open-source community thatcontributes and shares intellectual proper-ty, thus accelerating innovation.


FPGA Research Design PlatformFuels Network AdvancesFPGA Research Design PlatformFuels Network AdvancesXilinx and Stanford University are teaming up to create an FPGA-based referenceboard and open-source IP repository to seed innovation in the networking space. Xilinx and Stanford University are teaming up to create an FPGA-based referenceboard and open-source IP repository to seed innovation in the networking space.

XCEL LENCE IN NETWORKING

This basic platform will provide every-thing necessary to get end users off theground faster, while the open-source com-munity will allow researchers to leverageeach other’s work. The combination effec-tively reduces the time spent on the actualimplementation of ideas to a minimum,allowing designers to focus their efforts onthe creative aspects.

In 2007, we designed the first-genera-tion board, dubbed NetFPGA-1G, arounda Xilinx® Virtex®-II Pro 50, primarily toteach engineering and computer sciencestudents about networking hardware.Many EE and CS graduates go on to devel-op networking products, and we wanted togive them hands-on experience buildinghardware that runs at line rate, uses anindustry-standard design flow and can beplaced into an operational network. Forthese purposes, the original board had to below in cost. And with generous donationsfrom several semiconductor suppliers, wewere able to bring the design in at an endprice of under $500. As a result, universi-ties were quick to adopt the board andtoday about 2,000 NetFPGA-1Gs are inuse at 150 schools worldwide.

But the NetFPGA very quickly becamemore than a teaching vehicle. Increasingly,the research community began to use ittoo, for experiments and prototypes. Forthis purpose, the NetFPGA team providedfree, open-source reference designs andmaintains a repository of about 50 con-tributed designs. We support new users,run online forums and offer tutorials, sum-mer camps and developers’ workshops.

Trend Toward ProgrammabilityFor more than a decade, networking tech-nology has trended toward more-program-mable forwarding paths in switches, routersand other products. This is largely because

designs in addition to FPGA silicon so asto speed up the design process.

To realize this vision, we will deliver aboard with matching FPGA infrastructuredesign in the form of both basic anddomain-specific IP building blocks toincrease ease of use and accelerate develop-ment time. We further will develop refer-ence designs, such as a network interfacecard and an IPv4 reference router, as well asbasic infrastructure that assists with build-ing, simulating and debugging designs. Theidea is to allow users to truly focus theirdevelopment time on their particular areaof expertise or interest without having toworry about low-level hardware details.

Unlike the mainstream Targeted DesignPlatforms, our networking platform targetsa different end-user group, namely, the larg-er research community, both academic andcommercial. Semiconductor partners areheavily subsidizing this project to assist inthe effort to keep the board cost to anabsolute minimum so as to encouragebroad uptake. Not only Xilinx but otherkey component manufacturers such asMicron, Cypress Semiconductor andNetLogic Microsystems are donating partsfor the academic end user. (Commercialresearchers will pay a higher price.)

Part of the project’s strength is the fact thatthis platform is accompanied by a communi-ty as well as an open-source hardware reposi-tory that facilitates the sharing of software, IPand experiences beyond the initial base pack-age. The result is an ever-growing IP librarythat we hope will eventually encompass awide range of reference components, net-working IP, software and sophisticated infra-structure thanks to contributions from manywell-known universities, research groups andcompanies. We hope that by providing acarefully designed framework, some light-weight coordination to share expertise and IP

networking hardware has become morecomplicated, thanks to the advent of moretunneling formats, quality-of-serviceschemes, firewall filters, encryption tech-niques and so on. Coupled with quicklychanging standards, those factors havemade it desirable to build in programma-bility, for example using NPUs or FPGAs.

Researchers often want to change someor all of the forwarding pipeline. Recently,there has been lots of interest in entirelynew forwarding models, such asOpenFlow. Researchers can try out newnetwork architectures at the national scaleat national testbeds like GENI in theUnited States (http://geni.net) and FIRE inthe EU (http://cordis.europa.eu/fp7/ict/fire/calls_en.html).

Increasingly, researchers are embracingthe NetFPGA board as a way to prototypenew ideas—new forwarding paradigms,scheduling and lookup algorithms, andnew deep-packet inspectors—in hardware.One of the most popular reference designsin the NetFPGA canon is a fully function-al open-source OpenFlow switch, allowingresearchers to play with variations on thestandard. Another popular reference designaccelerates the built-in routing functions ofthe host computer by mirroring the kernelrouting table in hardware and forwardingall packets at line rate.

NetFPGA, Take TwoFor the second-generation platform, theso-called NetFPGA-10G, we haveexpanded our original design goals to alsoinclude ease of use, with the aim of sup-plying end customers with a basic infra-structure that will simplify their designexperience. This goal is closely alignedwith the objective of Xilinx’s mainstreamTargeted Design Platforms, which pro-vide users with tools, IP and reference


An open-source hardware repository facilitates the sharing

of software, IP and design experiences, promoting technological

solutions to the next generation of networking problems.


in a systematic way, and a well-designedplug-and-play architecture with standardizedinterfaces, the open-source hardware reposi-tory will grow, promoting technologicalsolutions to the next generation of network-ing problems.

The NetFPGA-10G is a 40-Gbit/secondPCI Express® adapter card with a largeFPGA fabric that could support as many

applications as possible. As shown in Figure1, the board design revolves around a large-scale Xilinx FPGA, namely a Virtex-5XC5VTX240T-2 device [1].

The FPGA interfaces to five subsystems(see Figure 2). The first of these, the net-work interface subsystem, hosts four 10-Gbps Ethernet interfaces with PHY devices

through which the network traffic entersthe FPGA. The memory subsystem, for itspart, consists of several QDRII andRLDRAMII components. The majority ofthe I/Os are used for this interface, to max-imize the available off-chip bandwidth forfunctionality such as routing tables orpacket buffering. The FPGA also interfacesto the PCIe subsystem.

The fourth subsystem is an expansioninterface designed to host a daughter card orto communicate with another board. Forthat purpose we brought all remaining high-speed serial I/O connections out to twohigh-speed connectors. Finally, the fifth sub-system is concerned with the configurationof the FPGA.

Overall, the board is implemented as adual-slot, full-size PCIe® adapter. Two slotsare required for reasons of heat/power andheight. Like high-end graphics cards, thisboard needs to be fed with an additionalATX power supply, since power consump-tion—given absolute maximum load onthe FPGA—could exceed the PCIe single-slot allowance of 50 watts. However, theboard can also operate standalone outside aserver environment.

Memory SubsystemA central focus in our design process wasthe interface to SRAM and DRAM com-ponents. Since the total number of I/Os onthe FPGA device limits the overall avail-able off-chip bandwidth, we had to strike acarefully considered compromise to facili-tate as many applications as possible.Trying to support applications rangingfrom network monitoring through securityand routing to traffic management impos-es greatly varying constraints.

In regard to external memory access,for example, a network monitor wouldrequire a large, flow-based statistics tableand most likely a flow classificationlookup. Both accesses would require shortlatencies, as the flow classification needsmore than one lookup with intradepen-dencies, and the update in a flow statisticstable would typically encompass a read-modify-write cycle. Hence, SRAM wouldbe an appropriate device selection.However, a traffic manager would mainlyneed a large amount of memory for pack-et buffering, typically implementedthrough DRAM components due to den-sity requirements. As a final example, con-sider an IPv4 router that needed arouting-table lookup as well as packetbuffering in respect to external memory.

Summing up the requirements from arange of applications, we realized that cer-tain functionality would consume externalmemory bandwidth, whether it be SRAMor DRAM. Where packet buffering (requir-ing a large storage space) would point toDRAM, SRAM would be the choice forflow classification search access, routing-table lookup, flow-based data table for sta-tistics or rule-based firewall, memory


Figure 1 – The NetFPGA-10G board is built around a Virtex-5 FPGA.

NetworkInterface

SubsystemExpansionInterface

PowerSubsystem

PCIeSubsystem

MemorySubsystem

ConfigurationSubsystem

Figure 2 – The board connects to five subsystems: network interface, PCIe, expansion, memory, configuration and power.



management tables for packet bufferimplementations and header queues.

All of this functionality needs to be car-ried out on a per-packet basis. Therefore,given the worst case of minimum-sizepackets of 64 bytes with 20 bytes of over-head, the system needs to service a packetrate of roughly 60 Megapackets per second.Second, we need to differentiate the access-es further. To begin with, many memorycomponents such as QDR SRAM andRLDRAM SIO devices have separate readand write data buses. Since the access pat-terns cannot be assumed to be symmetri-

cally distributed, we cannot pool theoverall access bandwidth, but must consid-er the operations individually.

What’s more, there is a third type ofaccess, namely “searches.” Searches can beideally implemented through TCAM-based devices, which give guaranteedanswers with fixed latencies. However, weruled out this type of device for our boardfor price and power reasons, along withthe fact that TCAMs further constrainthe use of the I/O. Searches can also beimplemented in many other ways such asdecision trees, hash algorithms anddecomposition approaches, with or with-out caching techniques, to name a few[2]. For the purpose of this discussion, weassumed that a search can be approximat-ed through two read accesses on average.Given these facts and making some com-mon assumptions on data widths, we

refined our requirements further as seenin Table 1.

Assuming a clock speed of 300 MHz onthe interface, a QDRII interface can service2*300 = 600 million accesses/second forread and for write operations. Hence, threeQDRII x36-bit interfaces could fulfill all ofour requirements.

In regard to DRAM access, we consid-ered mainly the case of packet storage,where each packet is written and read oncefrom memory. This translates into a dataaccess bandwidth of roughly 62 Gbps onceyou have removed the packet overhead

from the originally incoming 2*40 Gbps.In terms of physical resources, anRLDRAMII access can probably achievean efficiency of around 97 percent, where-as DDR2 or DDR3 devices would morelikely come in at around 40 percent [3],hence requiring significantly more I/O. Wetherefore chose RLDRAMII CIO compo-nents. Two 64-bit RLDRAMII interfaces at300 MHz deliver a combined bandwidththat is roughly enough to service therequirement.

Network InterfaceThe network interface of the NetFPGA-10G consists of four subsystems that can beoperated as 10-Gbps or 1-Gbps Ethernetlinks. To maximize usage of the platform aswell as minimize power consumption, wechose four Small Form-Factor PluggablePlus (SFP+) cages as the physical interface.

Compared with other 10G transceiverstandards such as XENPAK and XFP,SFP+ has significant advantages in termsof power consumption and size. WithSFP+ cages, we can support a number ofinterface standards, including 10GBase-LR, 10GBase-SR, 10GBase-LRM andlow-cost direct-attach SFP+ copper cable(Twinax). Furthermore, SFP modules for1-Gbps operation can be utilized, therebysupporting the 1000Base-T or 1000Base-X physical standards as well.

Each of the four SFP+ cages connects toa NetLogic AEL2005 device via the SFI

interface. The AEL2005 is a 10-GigabitEthernet physical-layer transceiver with anembedded IEEE 802.3aq-compliant elec-trical-dispersion compensation engine.Besides regular 10G mode, the PHY devicecan support Gigabit Ethernet (1G) mode.On the system side, these PHY devicesinterface to the FPGA via a 10-GigabitAttachment Unit Interface (XAUI). Whenoperating in 1G mode, one of the XAUIlanes works as a Serial Gigabit MediaIndependent Interface (SGMII).

Suitable interface IP cores are availableinside the Virtex-5 FPGA. For 10-Gbpsoperation, Xilinx markets the XAUILogiCORE™ IP as well as a 10-GigabitEthernet media-access controller(10GEMAC) LogiCORE IP [4, 5]. For 1Goperation, the interfaces can be directlyconnected to the embedded Xilinx Tri-Mode Ethernet MAC core [6].

Data width #Reads #Writes #Reads #Writes (bits) per packet per packet per x36 interface per x36 interface

Flow classification (5tupel + return key) 144 2 0 8 0

Routing-table lookup (dip + next hop) 96 2 0 6 0

Flow-based information 128 2 0 8 0

Packet buffer memory management 32 2 2 2 2

Header queues 32 1 1 1 1

Total number of accesses 25 3

Total number of accesses per second (MAps) 1,500 180

Table 1 – SRAM bandwidth requirements


PCIe SubsystemThe NetFPGA-10G platform utilizes theOpen Component Portability Infrastructure(OpenCPI) as its main interconnect imple-mentation between the board and the hostcomputer over PCIe [7]. Presently we sup-port the x8 first generation, but mightpotentially upgrade to second generation inthe future. OpenCPI is a generalized, open-source framework for interconnectingblocks of IP that may each use varied typesof communication interfaces (streaming,byte-addressable and so on), with variedbandwidth and latency requirements. In itsessence, OpenCPI is a highly configurableframework that provides the critical com-munication “glue” between IP that is need-ed to realize a new design quickly.

OpenCPI delivers a few key features forthe NetFPGA-10G platform. On the soft-ware side, we are able to provide a cleanDMA interface for transferring data (pri-marily including packets, although not

limited in any way to this type of informa-tion) and for controlling the device via pro-grammed input/output. For networkingapplications, we provide a Linux networkdriver that exports a network interface foreach of the physical Ethernet ports on theNetFPGA device. This allows user-spacesoftware to transfer network packets to thedevice as well as read any host-visible regis-ter in the design.

On the hardware side, OpenCPI pro-vides us with a clean, or even multiple,data-streaming interfaces, each config-urable through the OpenCPI framework.In addition, OpenCPI handles all of thehost-side and hardware-side buffering andPCIe transactions, so that users can focuson their particular application rather thanthe details of device communication.

Expansion Interface, Configuration SubsystemThe purpose behind the expansion sub-system is to allow users to increase port

density by connecting to a secondNetFPGA-10G board—to add differentflavors of network interfaces through anoptical daughter card, for example—or toconnect additional search componentssuch as knowledge-based processors withhigh-speed serial interfaces. We bring out20 GTX transceivers from the FPGA andconnect them through AC-coupled trans-mission lines to two high-speed connec-tors. Designed for transmission interfacessuch as XAUI, PCIe, SATA andInfiniband, these connectors can link toanother board either directly, with matingconnector, or via cable assemblies. Eachtransmission line is tested to operate at6.5 Gbps in each direction, thereby pro-viding an additional 130-Gbps data pathin and out of the FPGA.

Configuring FPGAs with ever-increas-ing bitstream size can potentially be aproblem when the configuration time ofthe device exceeds the PCIe limits. This is


FPGA

PCIe EP with DMA

XAUI

XAUI

XAUI

XGMIIAXI-StreamAXI-LiteCustom

XAUI

MDIOif

10GEMAC

Inputarbiter/outputarbiter

QDR memoryinterface

RLDRAM memoryinterface

clk & rst &pwr with

test

MicroBlazesubsystem

AXI interconnect

Configuration controland bitstream gear box

UARTreport

CPLD (flash controller andconfiguration circuit)

Figure 3 – The FPGA design architecture for 10G operation relies on the AXI protocol.



the case with the V5TX240T. The accessspeed of the platform flash devices, whichis significantly below what the V5TX240Tcan theoretically handle, imposes a bottle-neck. As countermeasures, designers mightconsider configuration of partial bit-streams, as well as accessing platform flashdevices in parallel and configuring theFPGA at maximum speed. To facilitatethe latter possibility, we equipped theboard with two platform flash devices thatconnect through a CPLD to the FPGA’sconfiguration interface. In addition, theboard also supports the standard JTAGprogramming interface.

The FPGA DesignA key aspect behind a successful open-source hardware repository must be a cleanarchitectural specification with standard-ized, abstracted and well-defined interfaces.In fact, we believe these interfaces to becrucial to providing a building-block sys-tem that enables easy assembly of compo-nents contributed from a large globalcommunity. Standardization ensures physi-cal compatibility, and a rigorous definitionthat goes beyond physical connectivityreduces the potential for misunderstandingon the interface protocols. Ideally, users canthen deploy abstracted components with-out further knowledge of their intrinsicimplementation details.

As part of the full platform release, wewill provide the complete architecturalspecification as well as a reference designwith key components. These key compo-nents include:

• PCIe endpoint from OpenCPI [4]

• Four XAUI LogiCORE IP blocks andfour 10GEMAC LogiCORE cores(the latter under licensing agreement)in 10G mode [7,8]

• Two RLDRAMII memory controllersbased on the XAPP852 [8]

• Three QDRII memory controllersbased on MIG3.1 [9]

• Clocking and reset block that checksclock frequencies and power-OK signals,and generates all required system clocks

• Tri-Mode Ethernet MAC (TEMAC)for 1G operation [9]

• MDIO interface for configuration ofthe PHY devices

• Control processor in the form of aMicroBlaze®, with a support systemthat handles all administrative tasks

• UART interface

• Configuration control and bitstreamgear box, which provides a program-ming path to the platform flash devices

• Input arbiter that aggregates allincoming traffic into one data stream

• Output arbiter that distributes the traffic to the requested output port

Figure 3 illustrates how these compo-nents are interconnected for 10G opera-tion. For data transport we chose theAMBA®4 AXI streaming protocol, and forcontrol traffic the interface is based onAMBA4 AXI-Lite. You can find furtherdetails on these interface specifications onwww.arm.com (and on the Xilinx webpagein the near future).

Status and OutlookDesign verification of the board is com-plete. Once production test development isfinished, the board should shortly bereleased for manufacturing. Initial FPGAdesigns are running and we are workingnow toward a first code release.

You can order through HiTech Global,which is manufacturing and distributingthe board (http://www.hitechglobal.com/Boards/PCIExpress_SFP+.htm). Differentpricing models apply to academic andcommercial customers. For the most up-to-date information on the project, please visitour website, www.netfpga.org, or subscribeto the netfpga-announce mailing list:https://mailman.stanford.edu/mailman/listinfo/netfpga-announce.

The first code releases, planned for thecoming months, will open up the sourcecode to the wider community. It basicallycontains the design detailed in the FPGAdesign section and implements the intro-duced architecture. In addition, we will put

significant effort into providing the righttype of infrastructure for building, simulat-ing and debugging the design. A last focuspoint in the coming months will be therepositories and framework that enable theefficient sharing of experience, expertise andIP within the NetFPGA community.

AcknowledgementsMany thanks to the following people for theirhelp with this project: Shep Siegel, CTO ofAtomic Rules and driving force behindOpenCPI; John Lockwood, CEO of Algo-LogicSystems, driving force behind the first genera-tion of NetFPGA and an active contributor tothe current platform; Paul Hartke from XilinxUniversity Program (XUP) for his endless ener-gy and support; Rick Ballantyne, senior staffengineer at Xilinx, for his invaluable advice onboard design; Adam Covington and Glen Gibbat Stanford for their support; Fred Cohen fromHiTech Global for his cooperation.

Furthermore, we would like to thank theNational Science Foundation and the followingsponsors for donating their components, namelyMicron, Cypress Semiconductor and NetLogicMicrosystems.

References

[1] XC5VTX240T data sheet: http://www.xilinx.com/support/documentation/data_sheets/ds202.pdf

[2] David E. Taylor, “Survey and Taxonomy of Packet Classification Techniques,” ACMComputing Surveys (CSUR), volume 37, No. 3,September 2005

[3] Bob Wheeler and Jag Bolaria, Linley Report:“A Guide to Network Processors,” 11th Edition,April 2010

[4] XAUI data sheet: http://www.xilinx.com/products/ipcenter/XAUI.htm

[5] 10GEMAC data sheet:http://www.xilinx.com/products/ipcenter/DO-DI-10GEMAC.htm

[6] Tri-Mode Ethernet MAChttp://www.xilinx.com/support/documentation/ip_documentation/hard_temac.pdf

[7] www.opencpi.org

[8] XAPP852 (v2.3) May 14, 2008:http://www.xilinx.com/support/documentation/application_notes/xapp852.pdf

[9] MIG data sheet: http://www.xilinx.com/products/ipcenter/MIG.htm


by Fabrice MullerAssociate ProfessorUniversity of Nice Sophia, Antipolis, France – LEAT - [email protected]

Jimmy Le RhunResearch [email protected]

Fabrice LemonnierProject [email protected]

Benoît MiramondAssociate ProfessorETIS, UMR 8051 CNRS - ENSEA - [email protected]

Ludovic DevauxPhDUniversity of Rennes 1/ Cairn Inria [email protected]

An autonomous robot finding its way inunknown terrain; a video decoder changingits decompression format according to sig-nal strength; a broadband electronic coun-termeasure system; an adaptiveimage-tracking algorithm for automotiveapplications. These are among the manyemerging embedded or mission-criticalapplications that show dynamic behaviorwith strong dependency on an unpre-dictable environment. Where staticallyresolved worst-case allocation was once ananswer to strong real-time constraints, flex-ibility is now also a requirement. The solu-tion proposed in a French research project isan operating system distributed onto theFPGA resources that manages both hard-ware and software threads.


A Flexible Operating System for Dynamic ApplicationsA Flexible Operating System for Dynamic ApplicationsBy using the dynamic reconfiguration capabilities of Xilinx FPGAs, software flexibility and hardware performance meet in a homogeneous multithread execution model.

By using the dynamic reconfiguration capabilities of Xilinx FPGAs, software flexibility and hardware performance meet in a homogeneous multithread execution model.

XCEL LENCE IN NEW APPL ICAT IONS

In recent years, denser systems-on-chip(SoCs) have made it possible to meet thedesign constraints by parallelizing tasks anddata, thereby increasing the number of com-putation units, in particular in embeddedsystems. This trend continues today with theaddition of heterogeneity of computingcores. But this technique becomes a wall ofcomplexity which dictates a higher level ofabstraction in the programming model.

To surmount these challenges, we pro-pose to define a unified execution modelthat is used whether the threads aremapped onto hardware or software. Thehardware implementation of this executionmodel relies strongly on the use of dynam-ically reconfigurable logic. Coupled with atraditional multicore software subsystem, afully distributed architecture provides thebest of both worlds. The software part iswell-suited to intelligent event-based con-trol and decision making, while the hard-ware part excels in power efficiency, highthroughput and number crunching. Inbetween, we get the ability to balance per-formance and resource utilization, not justfor each specific application, but also for aspecific state of an application.

The high integration capabilities of mod-ern, platform FPGAs make it possible to puta full, heterogeneous and dynamic comput-ing system into one or two chips, with ahigh level of flexibility and scalability.

The adaptive hardware is very useful inapplications such as missile electronics andsoftware-defined radio, where power con-sumption and system size are limited, andwhich must be highly reactive to the envi-ronment. Thanks to dynamic reconfigura-tion, you can implement dedicatedarchitectures for the different applicationmodes without increasing the power con-sumption of your system or the size of theboard. The classical solutions centralize the

you can deploy the application threads insoftware (processors) or in hardware(reconfigurable units), either statically ordynamically, with indifferent access to thedistributed services.

We have implemented operating-sys-tem services in hardware, next to thereconfigurable zones, to reach a high levelof efficiency. We have implemented acommunication layer between heteroge-neous OS kernels to ensure the homo-geneity of the services from theapplication’s point of view. So, deployingthe OS on the architecture as a large num-ber of modules and execution units takesadvantage of the virtualization mecha-nisms that allow the application threads torun and communicate without a prioriknowledge of assignment.

From the programmer’s point of view,the application is just a set of threads. Weuse the dynamic reconfiguration capabilitiesof Xilinx® FPGAs to propose the new con-cept of hardware threads, along with animplementation of this concept with thesame behavior as software threads. Ourimplementation takes advantage of the per-formance of dedicated computing IP blocks.

Along with the execution units presentin the multiprocessor SoC, the memoryorganization must meet several require-ments: the storage of the data needed by theapplication threads, the storage of the exe-cution context of each thread and theexchange of data between threads.Regarding the storage of execution contexts,we envisage several possibilities. One ofthese is to centralize the execution contextsand thus provide a medium for their distri-bution to different execution units. Weidentify three types of communicationstreams within the platform: applicationdata, control signals and configuration/exe-cution contexts. The high-bandwidth data

control and do not, today, seem able toaddress effectively both the number of exe-cution units and their heterogeneity. Onlya distributed approach that is both flexibleand scalable can take up this challenge ofcreating a future-proof architecture.

Despite the potential of this technology,the use of dynamic reconfiguration is stilla challenge for the industry. Engineersneed a clear design methodology to getthe full benefit of dynamic reconfigura-tion, without impacting on applicationdescription, and, above all, with noincrease in development costs. In anattempt to combine dynamicity and per-formance, we propose to abstract the het-erogeneity by means of an executionmodel based on multithreading. Thedeveloper will be able to program theirapplication as a set of threads, irrespectiveof whether the threads are executed on astandard processor or on dedicated hard-ware. In this context, dynamic reconfigura-tion becomes a method for threadpreemption and context switching. TheFOSFOR project (Flexible OperatingSystem for Reconfigurable platform), spon-sored by the French National ResearchAgency (ANR), is dedicated to developingthis new generation of embedded, distrib-uted real-time operating systems.

FOSFOR Architecture BasicsOur goal is to define an architecture thatsupports a new kind of system partitioning,where hardware and software componentsfollow the same execution model. Thisrequires a very flexible and scalable operat-ing system, one that provides similar inter-faces to the software and hardwaredomains. Unlike classical approaches, thisOS can be completely distributed, with thewhole platform being homogeneous fromthe application’s point of view. This means



Our goal is to define an architecture that supports a new kind of system parti t ioning, where hardware and sof tware

components fol low the same execution model. This requires a very f lexible and scalable operating system.

paths between hardware threads use a dedi-cated network-on-chip (NoC).

The Global ArchitectureThe global architecture is depicted inFigure 1 and consists of:

• A set of nonspecialized, or general-pur-pose, processors (GPPs). A GPP sup-ports the execution of software threads.It also supports a set of OS services,one of which schedules the threads.The GPPs are not necessarily homoge-neous in terms of instruction-set archi-tecture and number of offered services.

• A set of dynamically reconfigurablezones (called the reconfigurable region,RR), which are in charge of the simul-taneous or sequential execution of a setof hardware threads. Like a GPP, an RRsupports the execution of OS servicesthanks to a hardware OS (HwOS).These regions correspond to fine-grained (FPGA) or coarse-grained(reconfigurable processor) architectures.

• Virtual channels of communication—control, data and configuration—thatcan share one or more physical com-munication channels. The controlchannels allow communication

between OS services distributed ontothe execution units (GPP and RR).The data channels convey informationrelated to the environment (devices,sensors) and the exchange between thethreads. The configuration channelsallow the transfer of the configurationsof the software threads (binary code)and hardware threads (partial bit-streams) between the configurationmemory and the execution units.

Each processor has its own local mem-ory. This memory stores the local dataand, where applicable, the software code.A shared memory, connected to the datachannel, allows data sharing betweenthreads on different processors. Each exe-cution unit has access to shared memory,containing the data and the programs ofsoftware execution resources. Eachresource also has access to a configurationmemory, to save and restore its executioncontext. This structure makes it possibleto implement any thread or service on anyexecution resource.

Within the RR, only the hardware tasksneed to be dynamically reconfigured. Thedynamic region (DR), the area hosting thetasks, is surrounded by the static region(SR), which contains hardware implemen-

tation of OS services, along with communi-cation media both internal and external tothe RR. Internal data-stream communica-tions rely on a dedicated network-on-chip.The interfaces between DR and SR use abus macro, and have a fixed location. Toabstract this constraint, and the heterogene-ity of the communication medium, we pro-pose a middleware approach that providesvirtualized access to the reconfigurable zone.An RR is organized according to the modeldefined in Figure 2. The FOSFOR proto-type platform consists of dynamically recon-figurable FPGA devices that can alreadysupport this architecture model. We select-ed Virtex®-5 devices for their ability toreconfigure rectangular regions.

We have defined a scheduling/place-ment algorithm that ensures efficient use ofthe FPGA elements (LUTs, registers, dis-tributed memory, I/O) inside each RRaccording to the precomputed resourceneeds of the application threads.

OS, NoC and MiddlewareTo be flexible, the FOSFOR architectureuses at least two instances of operating sys-tems: a software OS that runs on eachprocessor and handles software threads, anda hardware OS that is able to manage hard-ware threads. To correctly balance the


ThreadThread

ThreadThreadApplication

Middleware(virtualization, distribution, flexibility)

OS1(X services)

OSn(Y services)

Hardware Abstraction Layer (HAL)

Software Communication Unit Hardware Communication Unit

Software Nodes(GPP)

Hardware Nodes(Reconfigurable Regions)

SharedMemory

NoC

Har

dwar

eS

oftw

are

Fle

xibl

e O

S

Figure 1 – Generic FOSFOR architecture



trade-offs between performance, develop-ment time and standardization, we used anexisting operating system for the softwarepart, but we designed the hardware operat-ing system from scratch.

The requirements for the software OSare real-time behavior, the ability to han-dle multiple processors and the availabili-ty of basic interprocess communicationservices. We chose a free, open-source OS,namely RTEMS (Real-Time Executive for

Multiprocessor Systems; see http://www.rtems.org/ ). For compatibility reasons, weselected the LEON Sparc soft-core proces-sor, which is also free and open-source, asthe software node.

Thanks to partial dynamic reconfigura-tion provided by Xilinx FPGAs, the HwOScan schedule hardware threads with asmuch flexibility as a classical operating sys-tem would with software threads. A hard-ware thread consists of two parts, onedynamic and one static. The dynamic partincludes an IP block that implements thethread functionality and a finite state

machine to synchronize the service callsequence with the HwOS. The static partcontains a control interface with theHwOS, and a network interface for dataexchange with the other tasks, both hard-ware and software.

To support diverse interthread datatransfer needs, we have developed a flex-ible network-on-chip called DRAFT.Communication services of a classicaloperating system are sufficient to support

communications between softwarethreads. However, in our case, the OSalso needs to support communicationsbetween hardware threads. We specifical-ly designed the DRAFT network for thispurpose. We synthesize each hardwarethread for one or more DRs, and stati-cally define each DR interface.

The static definition of the communi-cation interface allows us to define a staticnetwork-on-chip. In general, the hard-ware threads need large bandwidth andlow latency, so the NoC must providehigh performance. The topology we chose

for DRAFT is an extension of the Fat-Tree topology. The main objective of ourdesign is to limit resource overhead whileallowing high-performance interthreadcommunication.

The heterogeneity of the hardware plat-form is a major factor of complexity whendesigners deploy their applications. In theFOSFOR project, this heterogeneity comesnot only from the different embeddedprocessors in the software domain, but

above all from the presence of both soft-ware and hardware computation models ina single platform.

The middleware solves this problem byproposing an abstraction layer betweenhardware and software and providing ahomogeneous programming model. Itimplements a set of virtual channels thatallow communication between threads,independently of their implementationdomain. These services are distributedacross the platform and thus offer a scalableand flexible abstraction layer that com-pletes the FOSFOR concept.

Static Region Reconfigurable Region Static Region

Control

Context(bitstream)

Data

Hardware Zone

HwOS NoC

control data

DynamicRegion

Thread

Figure 2 – Reconfigurable region structure


Thanks to partial dynamic reconfiguration provided by Xilinx FPGAs, the hardware OS can schedule hardware threads with as much flexibility

as a classical operating system would with software threads.

Performance Speedup The main reasons for building a hardwareOS are performance and flexibility. The OScould have been fully software or fully hard-ware. Since each call to an OS primitive

involves overhead and means a waiting timefor the thread, the faster the OS is, the lesstime wasted. To evaluate this overhead, wemust compare the HwOS timings with theoriginal software OS, RTEMS.

Hardware local operations need only afew tens of cycles, while hardware globaloperations require a few hundred cyclesdue to shared memory access. We haveevaluated a speedup of about 60 times onlocal creation-and-deletion operations,and about 50 times for other operations,compared with results from the softwareoperating system.

The resource usage of the HwOS (Table1) varies greatly, depending upon the num-ber and the capabilities of activated services.For example, we select the number ofobjects (semaphores, threads and so on) foreach service. We use a Xilinx Virtex-5FX100T to implement the system. Thetable lists the resources that the HwOS uses.The remaining space can be used to imple-ment other system components and thehardware threads themselves.

Concerning network performance for aconfiguration where DRAFT connectseight elements with a word width of 32 bits,a buffer depth of four words and a 100-MHz frequency, the network-on-chip sup-ports a data rate of up to 1,040Mbits/second per connected element. Thetopology and the routing protocol of thenetwork guarantee that no contention andno congestion can occur. At least one com-

munication path is always available betweentwo connected elements. Data travelsthrough DRAFT with an average latencynear 45 clock cycles (450 nanoseconds),which is compliant with many applications.

Looking AheadWe have proposed an innovative OS thatprovides a homogeneous execution modelbased on multithreading, on a heteroge-neous multicore architecture composed of

processors and dynamically reconfigurablehardware IP blocks. A hardware OS man-ages the hardware threads, typically forthread creation and suppression but also forsemaphore and message queue services. Interms of communications, we have pro-posed improvements to the Fat-Tree NoCfor data exchanges, a dedicated bus for hard-ware thread management and a communi-cation layer for inter-OS synchronization.

The next step, from an industrial pointof view, is to demonstrate that the hardwarefunctions, which were added to ensurehomogeneity of the execution model, leadto a real improvement in programming effi-ciency while keeping a low performanceoverhead on the dedicated IP blocks.

We will demonstrate our approach on arepresentative Thales application, based ona search-and-track algorithm. The trackingthreads are mapped onto reconfigurablezones and are created dynamically, depend-ing on the target detection.


Number of implemented structures 8 16 32

CLB slices 2,408 (15%) 3,151 (20%) 4,327 (27%)

D flip-flops 5,498 (8.5%) 6,650 (10.4%) 8,918 (13.9%)

BRAMs 8 (3.5%) 16 (7%) 32 (14%)

Table 1 – Resource usage of the HwOS (Virtex-5 FX100)


Time to market, engineering expense, complex fabrication, and board trouble-shooting all point to a ‘buy’ instead of ‘build’ decision. Proven FPGA boards from theDini Group will provide you with a hardware solution that works — on time, andunder budget. For eight generations of FPGAs we have created the biggest, fastest,and most versatile prototyping boards. You can see the benefits of this experience inour latest Vir tex-6 Prototyping Platform.

We started with seven of the newest, most powerful FPGAs for 37 Million ASICGates on a single board. We hooked them up with FPGA to FPGA busses that runat 650 MHz (1.3 Gb/s in DDR mode) and made sure that 100% of the boardresources are dedicated to your application. A Marvell MV78200 with Dual ARMCPUs provides any high speed interface you might want, and after FPGA configu-ration, these 1 GHz floating point processors are available for your use.

Stuffing options for this board are extensive. Useful configurations start below$25,000. You can spend six months plus building a board to your exact specifications,or start now with the board you need to get your design running at speed. Best ofall, you can troubleshoot your design, not the board. Buy your prototyping hardware,and we will save you time and money.

www.dinigroup.com • 7469 Draper Avenue • La Jolla, CA 92037 • (858) 454-3419 • e-mail: [email protected]

DN2076K10 ASIC Prototyping Platform

Why build your own ASIC prototyping hardware?

Daughter Cards

Custom, or:

FMC

LCDDRIVER

ARM TILE

SODIMMR FS

FLASHSOCKET

ECT

Intercon

OBS

MICTOR DIFF

DVI

V5T

V5TPCIE

S2GX

AD-DA

USB30 USB20

PCIE SATA

Custom, or:

RLDRAM-IISSRAM

MICTOR

QUADMIC

INTERCON

FLASH

DDR1

DDR2

DDR3

SDR

SE

QDR

RLDRAM

USB

12V Power

JTAG Mictor SATA

SATA

Seven50A Supplies

10/100/1000Ethernet

SMAs forHigh Speed

Serial

SODIMM Sockets

PCIe 2.0

PCIe to MarvellCPU

PCIe 2.0

USB 2.0

USB 3.0

10/100/1000Ethernet

10 GbEthernet

GTXExpansionHeader

3rd PartyDeBug

TestConnector

12V Power

All the gates and features you need are — off the shelf.

SODIMM SODIMM

by Christopher Fenton

The year was 1976. Disco was still popular,the Cold War was in full swing and Iwouldn’t even be born for another nineyears when the Cray-1 burst onto the com-puting scene. Personal computing was bare-ly in its infancy (the MITS Altair had beenintroduced a year earlier) at the time, andcompanies like Control Data Corp. andIBM dominated the high end. The Cray-1was one of those legendary machines thathelped define the term “supercomputer” inthe public imagination. Its iconic C-shapestructure housed a fire-breathing machinerunning at 80 MHz—something desktopswouldn’t reach until almost two decadeslater. The Cray had speed. It had style.

Now let’s fast-forward 33 years, to themorning in early 2009 when I woke up andjust decided I wanted to own one.

I first got into FPGA-based retro-com-puting, something I lovingly refer to as“computational necromancy,” shortly aftergraduating from the University of SouthernCalifornia with a BSEE in December 2007.As a newly minted electrical engineer and all-around fan of arcane computer architectures,I saw this pursuit as the perfect excuse toimprove my Verilog skills. Starting with aDigilent Spartan®-3E 1200 board that Ibought myself as a graduation present, myfirst machine was another abandoned relic ofthe 1980s, the NonVon-1. This was one ofthe first “massively parallel” machines, simi-lar to the more successful ConnectionMachine series of the same vintage, althoughgeared more toward databases. It was a won-derfully odd machine, composed of a binarytree of 8-bit processors (with 1-bit ALUs).


Resurrecting the Cray-1 in a Xilinx FPGAWant a classic, mammoth, number-crunchingsupercomputer of your own? Build one.

XPER IMENT

After a few months of tinkering, Ieventually found myself the proud ownerof a 31-node supercomputer dwarfed incomputing power by any modern wrist-watch. As useless as it was, however, themachine made me realize just how farMoore’s Law has brought us. And it whet-ted my appetite for more.

After my success with the NonVon-1, Iwas casting around for a new project (andmy Verilog skills were still a bit lacking). Irealized that low-end FPGAs had grown tothe point that they could handle some pret-ty serious hardware—even 32-bit “soft”processors are fairly common these days.Searching about for a new target to try torevive, I considered a few—the UNIVAC isan interesting machine, but it’s a bit too oldfor me. Digital Equipment Corp.’s PDPseries has been emulated before. Simulatorsfor Z80 machines are commonplace. That’swhere the Cray comes in.

What is the Cray-1? The Cray-1 was Seymour Cray’s firstmachine after splitting off from ControlData and founding his own company, CrayResearch, in the early 1970s. It was a ruth-less number cruncher that required a roomfull of computers and disks to keep it fedwith data. It also had a full-time staff ofengineers just to keep it running, and near-ly required its own power plant just to bootup. This is a machine that redefined theterm “supercomputer” (I mean, it’s aCray)—and, fortunately, it’s also beautiful-ly simple in its design. Thankfully, it’sincredibly well-documented too (Figure 1).The Cray-1 Hardware Reference Manuals(readily available on the Internet) go into alevel of detail that’s almost shocking tomodern-day readers used to being handedblack boxes. Nearly every op-code, registerand timing diagram is documented inexquisite detail.

store instructions) or between two operandregisters and a destination register (allarithmetic/logic instructions). Instructionsare either 16 or 32 bits long. The machineuses three different types of registers:address, scalar and vector registers. The

The computer itself is a 64-bit,pipelined processor with in-order instruc-tion issue and a mere 128 unique instruc-tions. It has a very RISC-like instructionset, with all instructions being eitherbetween memory and registers (load or


Figure 1 – Fortunately for hobbyists, the Cray architecture is beautifully simple in its design and very well-documented.

XPER IMENT

The Cray-1 was one of those legendary machines that helped define the term ‘supercomputer’ in the public imagination. Its iconic C-shape structure

housed a fire-breathing machine running at 80 MHz—something desktopswouldn’t reach until almost two decades later.

address registers are 24 bits wide and let themachine address up to 4 Megawords (32Mbytes) of main memory. The scalar regis-ters, which are 64 bits wide, are used forcomputation. Each vector register containssixty-four 64-bit registers, giving themachine great performance when doingscientific calculations on large matrices.

Inside the CPU, instructions can beissued to 13 independent, fully pipelined“functional units.” Heavy pipelining was

crucial to achieving the Cray’s insanelyhigh (for the time) 80-MHz clock fre-quency. Separate functional units handlelogical operations, shifting, multiplicationand so on. A floating-point multiplyinstruction, for instance, takes seven cyclesto complete, but the computer can issue anew multiply instruction on every cycle(assuming no register conflicts exist). Aninteresting consequence of this design isthat there is no “divide” instruction.Instead, the machine uses “division byreciprocal approximation.” Rather thancomputing X / Y, you compute (1 / Y) * X.A separate floating-point “reciprocalapproximation” functional unit can calcu-late a reciprocal in 14 clock periods.

The MarathonWhen I first began working on this project,I still hadn’t convinced myself it would bepossible to re-create such a sophisticatedcomputing machine by myself. The origi-nal Cray-1 took a whole team of peopleyears to design and build. Was I motivatedenough to stick with it? (As it turned out,

yes, I was.) Was my FPGA big enough toactually fit it? (As it turned out, no, it wasnot.) Even if the design is fairly straight-forward, it’s still a large design (currently~5,600 lines of Verilog and counting). Ijust had to get myself into the right mind-set. Building your own supercomputer is amarathon, not a sprint. I could only hopeto accomplish it one step at a time.

I started, one by one, with creating thefunctional units. Like building your own

hot rod, building a complete computer getsyou acquainted with every aspect of adesign in a way you would rarely experi-ence otherwise. I explored multiplier andadder design. I reopened textbooks onfloating-point arithmetic. I learned how touse three iterations of the Newton-Raphson method to compute a reciprocalapproximation to 30 bits of accuracy (did Imention how detailed the hardware refer-ence manual is?).

One by one, the functional units tookshape. This was a strictly “free-time” proj-ect, so progress came in fits and starts. Istarted with the easiest blocks first, and fin-ished the two address functional units (asimple adder and a multiplier) withoutmuch difficulty. My momentum started tofalter as I tackled the scalar functional units(an adder, a logical unit, a shifter and apopulation/leading zero count). I hit a lowpoint in my motivation as I fiddled withthe three floating-point functional units(an adder, a multiplier and the infamousreciprocal approximation unit). As I said,this was a marathon, not a sprint. I started

working on the Cray-1 in early 2009 andprobably spent 19 to 20 months total on it.

I started to get my second wind towardthe end of the floating-point units’ design,and regained steam as I moved on to thevector units. As I mentioned earlier, theCray-1 was designed as a number-crunch-ing behemoth. It has eight vector registers,each of which holds sixty-four 64-bit regis-ters. When a vector instruction executes,say an addition operation, one entry fromeach operand will be added and stored in athird (result) vector on every cycle.

An awesome feature that the Cray-1supports is called “vector chaining.” Thevector add unit, for instance, only takesthree cycles to generate the first result. Ifwe’re adding two 64-entry vectors together,however, we don’t want to wait for all 64entries to finish adding before we do some-thing with the result. Vector chainingallows us to “chain” the result coming outof the adder unit straight into the input ofanother unit, without waiting for the oper-ation to finish. We can start multiplyingthe result with a third vector two cyclesafter the first result is available. For somelarge matrix calculations, you could almostsustain two floating-point operations perclock cycle—at 80 MHz, that’s a peak rateof 160 MFLOPS! Common desktop com-puters didn’t catch up to the Cray-1 untilthe mid-1990s.

With the functional units in place, Icould almost see the light at the end of thetunnel. Surely it was just a matter of addingin a bit of glue logic and being done with it,right? Well, close. It turns out there’s a lotof glue logic. Even though the Cray-1 iswell-documented, it’s not that well-docu-mented. I knew exactly what every instruc-tion was supposed to do, but I got stuckreverse-engineering minor (and not-so-minor) details like instruction issuing, haz-ard detection and vector chaining. Somethings, like big 64-bit data buses, are prob-ably easier to build with discrete logic chipsthan with FPGAs designed for narrowerdatapaths. The vector registers gave me arouting headache.

A few features I also had to fudge. TheCray-1 had a 16-bank all-SRAM memorysystem the size of my refrigerator that could


XPER IMENT

Building your own supercomputer is a marathon, not a sprint.

I could only hope to accomplish it one step at a time. I started, one by one,

with the functional units, creating the easiest blocks first.


sustain 640 Mbytes/second of bandwidth(one 64-bit word per cycle at 80 MHz) toits 4 Megawords of memory, something themeasly DDR memory chip on my develop-ment kit could never approach. I woundup using nearly all of my FPGA’s on-dieBlock RAM to scrounge together a mere 4kilowords of memory space, by far myCray’s biggest limitation at the moment.And I had to leave a few features out alto-gether: DMA-style I/O channels designedto communicate with disk drives and“host” minicomputers, and rapid context-switching support. These might make itback in once I get a useful amount of mem-ory and a bit of software for the machine.

Hardware Obstacles I should take a few words to make note ofsome of the FPGA-related challenges I raninto along the way. First of all, my originalSpartan-3E 1200 chip didn’t prove to be upto the challenge. As soon as I added theCray’s mammoth vector registers to mydesign, I overflowed the chip’s meager logicresources. I started to sweat a bit at thispoint—I was already more than a year intothis project, and the price tag of biggerFPGAs seems to go up exponentially withsize. The larger Virtex® chips could handlemy design with ease, but developmentboards typically exceed a few years’ worthof my hobby budget. Fortunately, Digilentalso sells a slightly larger Spartan-3E 1600board for a still-hobbyist-friendly price. Itproved to be just large enough to fit thewhole system (and significantly boosted theamount of Block RAM I had to play with).

The other problem I ran into was sim-ply one of speed. Despite 30 odd years ofMoore’s Law, the Cray-1’s original 80-MHz design proved to be too much formy poor Spartan-3E. My initial designtopped out at about 33 MHz, with thecritical path running through cascadedadders I used in my rather naive imple-mentation of a floating-point multiplier.Fortunately, the Spartan-3E is equippedwith a number of 18-bit hardware multi-pliers that were able to boost the speed upto nearly 50 MHz (and shave off 5 percentof the area), but at that point the rest ofthe design starts to run into a wall. The

Cray-1’s spaghetti bowl of 64-bit data-paths and complicated instruction-issuelogic limit the cycle time to 20 nanosec-onds or so on the Spartan-3E. For nowI’m satisfied, but maybe one of the newerSpartan-6 chips will be up to the challengeof hitting the magical 8-0 mark.

Case ConstructionWith the hardware in mostly working order,I moved on to the fun part—the case. Sure,the circuit design is the interesting part forany engineer, but what fun is owning yourown Cray-1 if it doesn't look like a Cray-1?It also gave my friend Pat the perfect projectto try out his new CNC milling machine. Afew trips to Home Depot and a busy

Saturday in Pat’s garage (see Figure 2) leftme with something that was definitely start-ing to look like a miniature Cray. The squareshape of the development board meant I hadto get a bit creative with the base (squaresand C-shapes don’t play nicely together),but the board’s compact dimensions conve-niently meant that my design worked out tobe almost exactly 1/10 scale.

Another few weeks of sanding, paintingand polishing followed. Finally, a trip to thelocal fabric store let me finish off the designwith a tiny replica of the Cray-1’s distinctive

built-in “pleather” couch. Finally, Mattel’snew Computer Engineer Barbie had some-where to rest her weary feet!

The SoftwareWith the CPU in mostly working shapeand a matching case all ready to go, I wasprepared to declare victory. I was flawlesslyexecuting simple loops and programs of afew instructions. But what about real soft-ware? One of the unfortunate things aboutmy shiny new Cray-1 was its total lack ofsoftware, and a computer without softwareis only marginally more useful than thesand it’s built from. The Cray-1 unfortu-nately suffers from the double-whammy ofexisting in a pre-Internet world and being

sold primarily to government agencies withscary-sounding acronyms. I spent monthstrawling the depths of the Internet lookingfor software, but came up dry. I e-mailedpeople at some of our national labs. I evenfiled a Freedom of Information Act requestwith the National Nuclear Security Agency(government watch-list, here I come), butstruck out there too.

Seeing that I needed a little more help, Ifinally decided to enlist the citizens of theInternet. I added a page on my micro-Cray(and adorable 1/10-scale case) to my web-

Figure 2 – Putting together the iconic C-shaped case took some finessing and the help of a handy friend.

XPER IMENT

site, told a few of my friends with Twitteraccounts about the project and let theInternet do its thing. Within a few days thestory had been posted all over a number ofnews aggregator sites, flooding my in-boxwith e-mails from all over the world. Manywere nostalgic messages from former“Crayons,” sympathetic with my plight.Gradually, however, a few people surfacedwith decades-old source code on stacks ofpaper, nine-track tapes and even a reel ofmicrofiche. Someone even e-mailed me a20-year old PhD thesis, including theirsource code for a Cray-compatible compil-er for an obscure programming language(written, of course, in that same obscureprogramming language). It was the ulti-

mate moment of pack-rat vindication forpeople storing obsolete software and docu-mentation “just in case.”

The FutureWhat does the future hold for my minimainframe (Figure 3)? I’d still like to ironout some more of the machine’s bugs andfinish up the missing features. I haven’tcompletely solved my software problem,but things are starting to improve. Onehelpful netizen dug up a copy of a DOS-based Cray simulator from the late 1980sthat also doubles as a simple assembler (andstill runs perfectly under Windows 7).Programming in Cray Assembly Languageis a dream compared with using straight


octal machine code. If I can overcome thechallenges of obsolete media, it looks like Imight also be able to get my hands on thesource code for at least one operating sys-tem as well as a real Fortran compiler.

My obscure hobby also caught the eye ofsomeone at the Computer History Museum(http://www.computerhistory.org/), so maybemy tiny Cray will retire to Mountain View,Calif., someday. In the meantime, though,I’m already starting to think about what mynext project could be.

Figure 3 – The final micro mainframe, built on a Spartan-3E.

C h r i s t o p h e rFenton, an elec-trical engineerliving in NewYork City, is anactive memberof his localHackerspace ,NYCResistor.He is an avid

historical-computing enthusiast andenjoys building impractical andunnecessarily complicated projects inhis free time. Details on the Cray-1and other projects are available on hiswebsite, www.chrisfenton.com.

XPER IMENT

by John McCaskill President Faster [email protected]

David LautzenheiserBusiness DevelopmentFaster [email protected]

Recognizing that it is no longer sufficientto just build an ever-larger FPGA, Xilinxhas put the spotlight on the partial-recon-figuration capabilities of its devices as apowerful weapon in the competitive land-scape, and more importantly has dedicatedthe resources and support to make partialreconfiguration an equally powerful capa-bility for users. The tools and methodologyare now in place to enable a much broaderset of design teams to take advantage ofpartial reconfiguration than the avid earlyadopters of the past. This design techniqueis truly ready for prime time.

Xilinx’s main-line FPGAs, by the verynature of the static-memory cells used tohold configuration information, havealways been reconfigurable. In the earliestdays, designers viewed this shape-shiftingas a big headache because of the noveltyof logic that “forgot” its function whenthe power was turned off. The initial neg-ative perception was soon overcome asdesigners and their management realizedthe ultimate power of a logic device thatcould be reconfigured, enabling endlessengineering experimentation, last-minutedesign changes or fixes and—to a fewearly visionaries—the idea that a systemcould be conceived of with multiple per-sonalities. Thus was born the notion ofpartial reconfiguration.

In early devices, there was such a largetransistor overhead burden of static mem-ory to hold configuration data that theprogramming mechanisms had to be assimple and “lean” as possible. To keep

things simple, early FPGA families couldonly be configured or reconfigured as awhole. Even in this restricted environ-ment, some clever users decoded bit-streams and figured out how to changeLUT functions and a few other things inattempts to allow some on-the-fly changesto devices in their systems—a very earlyform and use of reconfiguration on the fly.

As Moore’s Law progressed, an ever-increasing number of transistors becameavailable for FPGA architects to considerusing in new and novel ways. Early archi-tectural experiments with completely ran-dom access to configuration data, such asthe Xilinx® 6200 family, or with multipleconfiguration memory planes, such as thePrism project, tested what might be possi-ble if some of this flood of transistors wasapplied to configuration schemes beyondjust full-chip programming.


FPGA Partial Reconfiguration Goes MainstreamFPGA Partial Reconfiguration Goes MainstreamThis PR primer shows that techniques to reconfigure portions of a device on the fly are well within the reach of any design group.

This PR primer shows that techniques to reconfigure portions of a device on the fly are well within the reach of any design group.

XPERTS CORNER

At the same time, design methods shift-ed from logic captured in schematics tofunctionality detailed in hardware descrip-tion languages. The move to HDL designenabled a design team to conceive andimplement more complex designs, but alsoremoved them from many of the intricatelow-level details of early FPGA design. Thisshift enabled teams to consider higher-levelsystem functionality and alternative solu-tions more rapidly than previously possible.

All of the experiments and best thinkingto that date went into the first Virtex® fam-ily. At this point, it became possible to puthardware capabilities into the device archi-tecture that the tools flow could not imme-diately support. The hope, or perhapsdream at the time, was that tools wouldrapidly improve to take advantage of newand novel architectural capabilities. Whileit was a noble goal, designers never reallygot access to the vast majority of the reallyclever things in the first Virtex family—including the first commercially availabletrue partial-reconfiguration capability.Figure 1 illustrates this concept of partialreconfiguration.

With the never-ending pace of improve-ments in FPGA devices and tools, the firstVirtex family soon gave way to Virtex-2and then II-Pro, 4, 5, 6 and, now, Virtex-7.

stumbling block to broader acceptance hasbeen the lack of a well-defined methodolo-gy and supporting tools that fit with thedesign style of a significant and substantialfraction of the FPGA design community.

But today, the tools and support arefinally in place and this powerful designapproach is poised for broad uptake. Thebalance of this article will provide a primeron how an average designer can approachpartial reconfiguration (PR) as a uniquesystem capability and be successful at it,without diving as deeply into FPGA anato-my as was previously required. Those unfa-miliar with PR terminology will find acheat sheet on the lingo in Table 1.

Some Definitions and RequirementsWhile implementing PR in a design islargely automated with the latest releases,ISE® 12.1 or newer, of Xilinx design tools,the innovation involved in determiningwhere to use partial reconfiguration is stillvery much a manual process. Only theingenuity of a designer is able (at this pointin tools evolution) to see where to apply PRand how to manage the overall system to

Throughout this evolution, a small contin-gent of dedicated users took whatever toolsthey could find, or in some cases createdtheir own, and figured out very clever sys-tem uses for partial reconfiguration. A key


XPERTS CORNER

Full BitFile

ConfigurationPort

or ICAP

Partial BitFiles

Function C

3

Function B

1

Function A

2

Reconfigurable partition (RP)

Reconfigurable module (RM)

Static logic

Configuration

Partition pins

Proxy logic

Configuration frame

Internal Configuration Access Port (ICAP)

Bottom-up synthesis

Reconfigurable logic

Partition

Design hierarchy instance marked by the user for reconfiguration;a physical area of the FPGA that will be configured

A specific instance of the logical design that occupies the RP

Smallest addressable segment of configuration memory space

Internal connections to the SelectMAP interface for FPGA configuration

Synthesis of individual partitions with no optimizations acrossboundaries, resulting in a separate netlist for each partition

A logical section of a design at a hierarchy boundary

Any logic that is part of a reconfigurable module

LUT1s inserted automatically on each partition pin to anchorconnections between static logic and the RMs

Pins at ports on a partition that interface between staticand reconfigurable logic

A full design consisting of static logic and one RM for each RP

All of the logic in the design that is not reconfigurable

Figure 1 – Partial reconfiguration is the ability to dynamically modify blocks of logic by downloadingpartial bit files while the remaining logic continues to operate without interruption.

Table 1 – Talking the talk: a PR glossary

use it successfully. Some considerations andguidelines include:

• What modules can or should be“swapped” using PR?

• What is the impact of swapping on thestate of the balance of the design andwhat steps are needed to make theswapped logic function properly (isola-tion/decoupling requirements, andhow do I manage those in my system)?

• What is the impact on performance(timing relationships, etc.) and on verification?

• How am I going to manage the PRprocess, including storing and deliver-ing the configuration files? How con-cerned am I about the reliability of thePR process?

• What constraints does PR impose onclocks and clock resources?

Let’s examine each of these questionsbriefly as we prepare to use PR in an actu-al design.

What Modules Can Be Swapped Using PR? The key for using PR is to identify por-tions of the design that do not need to beactive all the time in all uses of the FPGAdesign. For the most part, you can identi-fy these sections simply by looking at a

high-level block diagram or function listand asking a few basic questions:

• Is this piece of functionality depend-ent on some external event, option,user command, etc. that initiates itsuse separate from some other, similarfunction?

• Would the implementation of thisfunction be the result of a decisionsuch as “If X then Y” or “If X then Yelse Z,” or perhaps a case statement?

• Are there large sequential processingfunctions in the design that do notoverlap? (Even in the case of someoverlapping functions, alternatingfunctions vs. purely sequential onesmight be candidates for swapping.)

• Are there parts of the design that arenever going to be operating at thesame time? This is only obvious tosomeone familiar with the overall sys-tem architecture, but is one of themost basic uses of PR dating back tothe beginning of FPGAs.

• Conversely, is there a portion of thelogic that should be locked in placeand never changed or reconfigured?This might serve as a base static-logicset, with all else potentially consideredfor PR.

• Some end equipment constraints mightforce a startup-time constraint on theFPGA. For large FPGAs it might bepossible to configure only the minimalportion required for startup and havethe remainder as one or more reconfig-urable modules (RMs) that get loadedafter startup is complete. PCI Express®

implemented in a large Virtex device isa candidate for PR in this mode ofoperation. (See Xilinx Answer Record35380 for details.)

Using PR in a design should be consid-ered as akin to “hot-swapping” a device orboard in a system. As the reconfigurable par-tition (RP) is reconfigured, routingresources, logic functions, RAM blocks andso forth will be changing state. The designermust consider this and use implementationtechniques that will prevent any spuriousdata within the RM from affecting operationof the static logic while the RP is being con-figured. Best practices include forcing theRP into a reset state prior to starting recon-figuration, and holding it in reset until afterreconfiguration has been completed and suf-ficient time or clock cycles have occurred sothat it will emerge in a known-stable state.Figure 2 illustrates this practice. In addition,we strongly suggest that you place isolationregisters in both the static logic and the RMat the boundaries for all inputs and outputs.


XPERTS CORNER

Static Logic — Operating

RM 1Operating

RM 2Operating

RM 1Reset

RM 2Reset

Assert

Deassert

Reset

Reset

Synchronization Halted

Partial Reconfiguration

Figure 2 – Synchronization best practice


Impact on Verification Clearly, the addition of isolation registersat the boundaries between static logicand RMs may add cycles of latency to theoverall operation of the combination ofthe static logic and each RM that getsswapped into the RP. However, these reg-isters will also make timing closure easier.Also, the additional constraints of put-ting boundaries on where the RM logiccan be located may cause some decreasein the maximum operating speed andachievable density.

Only the designer can make the deci-sion about the impact of these trade-offsto the overall design, or the propriety ofeliminating isolation registers on a case-by-case basis, for example. Because it isjust another FPGA design, you mustmanage the logic inside the RM using thesame techniques to achieve the desiredsystem performance.

You must verify the combination of thestatic logic and each RM as a unit, just asif it were all static logic. Where verificationbecomes more complex and the current

state of tools does not help is in the transi-tion from one RM to another, and theoverall system operation during thatprocess. At this time, there is no means tosimulate the FPGA during the reconfigura-tion process. Therefore, you can only doverification up to the point where a recon-figuration is initiated, and can only startfrom the point where reconfiguration hasbeen completed going forward.

You can emulate the process of partialreconfiguration by using a code wrapper inyour simulation to swap out which RM isbeing used, with the appropriate timepause for the reconfiguration to completeand the new RM to become active. Thiswill be much easier and yield more realisticresults if you follow the reset and isolationregister guidelines.

Reconfiguration is a subset of completedevice configuration and uses essentiallythe same hardware and techniques. Whilethe tools flow will properly produce theappropriate configuration files, the design-er must figure out what method of pro-gramming is appropriate for the system

being designed, and what precautions orsafety measures are required based on thenature of the end equipment. The method-ology section later in this article discussesthis issue further, but the designer mustconsider it when contemplating use of PRin the overall system design. Biometrics,software-defined radio, cryptography,high-performance computing and net-working are among the system applicationsbeginning to use PR today.

Methodology and ToolsStandard tools for Xilinx FPGA designshave advanced to the point of being com-pletely applicable to PR designs. You willneed a separate license from Xilinx toenable the various aspects of the tools thatare specific to PR, so talk to your Xilinxrepresentative if you want to try them out.The design flow is built around thePlanAhead™ tool. If you are not familiarwith PlanAhead, now is the time to sign upfor a class to become expert with it.

You implement a design that will takeadvantage of PR just like a regular design,

XPERTS CORNER

MGT

MGT

EMAC

EMAC

FSL

FSL

Virtex-4 FPGA ICAP

RPPLB

Network

PCIController

DDR2Controller

DDR2 Memory(SO DIMM)

Mini-SDCard

FTLPacket

ProcessingEngine

• PPC• PLB BRAM• PLB20PB• FTL Board Support• SRAM • SRAM cntlr• MiniSD cntlr• Console RS 232• etc...

EmbeddedLinux

Figure 3 – Design example block diagram

but you must use a “bottom-up” synthesisflow. This simply means that no optimiza-tions will be done across the boundariesbetween the static logic and the RMs thatwill be swapped. To accomplish this, eachof the partitions will be a separate synthesisproject with its own separate netlist.

The general steps for a design usingPR are:

1. Set up the design structure, decidingon static vs. reconfigurable logic.Synthesize all netlists, define RMs andcreate partitions. Then, assign RMnetlists to the appropriate RPs.

2. Provide the proper constraints foreach RP based on the assigned RMs.

3. Run the PR-specific design rulechecks (DRCs) that are in PlanAhead.

4. Place and route, remembering thateach configuration is a completedesign (static logic and one netlist foreach RP). Create a “golden reference”for the static logic and iterate to closetiming on all combinations of staticlogic and RMs.

5. Create the bit files.

6. Test the design.

Design ExampleWe will present an actual design to showthe steps involved in implementing PR inthe real world. The design uses a XilinxXCV4FX60 FPGA to process networkpackets that are streaming into it from aGigabit Ethernet interface. Two sets ofoperations are to be performed on thepackets based on input from the user. Incase 1, packets are inspected for Ping pack-ets. When a Ping is received, the systemimmediately returns it to the sender withthe proper MAC and IP address reversaland CRC. We did this design in VHDL.

In case 2, the packets are similarlyreceived and inspected, but here the sys-tem is looking for UDP packets. When itreceives one, it performs MAC and IPaddress reversal and proper port assign-ment before returning it to the senderwith the corrected CRC. We did thisdesign in Impulse C to show that the RM

logic can come from different sources;only the synthesized netlist is important.In both cases, turnaround time fromreceipt to response is critical, so all of thepacket manipulation is done immediatelyafter receipt by the MAC. Figure 3 showsa block diagram of this system.

The hardware for this example uses aVirtex-4-based FPGA accelerator card fromFaster Technology (FTL): the P-6 card,with a XCV4FX60 FPGA. SFP modulesprovide direct access to the FPGA fromGigabit Ethernet connections of eithercopper or fiber-optic media. The P-6 isequipped with a robust embedded Linuxsystem, which greatly simplifies the imple-mentation of this example, as well as pro-viding a means to demonstrate use of theInternal Configuration Access Port (ICAP)block for partial reconfiguration from anembedded processor.

The first step, of course, is the creativeone of deciding what logic will be in thestatic area and what logic will be in theone or more modules that will be swappedinto or out of the static design. Once youhave settled on that basic structure, thenyou can use the design flow to implementthose decisions.

For our example design, the embed-ded Linux system with the PCI interface,the basic Gigabit Ethernet MACs andthe ICAP interface will be in the staticlogic. Each of the packet-processingengine implementations will be in a sep-arate RM, since they are never used atthe same time. While this is a simpleexample of mutually time-exclusive func-tions, it is readily extensible to a widevariety of system uses.

The netlist structure and the physicalpartitions on the device mirror eachother. Because the netlist will determinethe size of RP needed, it is generally bestto synthesize all of the modules that areRM candidates to determine the logicresources needed before deciding on thepartition sizes and shapes.

In the example design, the P-6 cardsupplies the elements of the embeddedLinux. The design uses a number of spe-cial IP blocks supplied by FTL, includ-ing a PCI interface, a dual GigabitEthernet MAC wrapper around theXilinx hard macros and some special IPblocks to simplify the integration of userIP into the overall Linux system. Becausethe PR flow deals with the netlists, the


XPERTS CORNER

Figure 4 – EDK view of the design example


blocks in the system can come from vari-ous sources and flows.

We created the simple Ping-return RMin VHDL as a separate ISE project. Wedeveloped the more complex packet-pro-cessing block as a module in C, using theImpulse C tools from ImpulseAccelerated Technology. Figure 4 showsthe EDK view of the “golden” implemen-tation with the UDP response design inplace. The FSL bus connects the RM tothe static logic in this design.

Once you have synthesized all thenetlists, create a project withinPlanAhead and add the netlists to it; PRdesigns are based on the netlist, not theHDL code. Create the partitions that willcontain each RM or group of RMs byselecting the netlist from the project tree(right click on it, and select “SetPartition”). This will bring up a wizard towalk you through the process. One of theoptions will be to make the partitionreconfigurable. Partition boundaries arenot required to align with reconfigurableframe boundaries, but you will achievethe most efficient design results whenthey are aligned. Frame boundaries andthe resources in each frame are differentfor different FPGA families, so be sure tocheck the documentation for the familyyou are using.

Since clocking resources cannot bepartially reconfigured, all global clocknets that an RM might use must be rout-ed to the RP that the RM will occupy.Routing congestion can result if manyclocks are used in multiple clock regions.You can minimize this effect by reducingthe number of clocks that an RP requiresand the number of clock regions it spans.In general, short and wide regions are bet-ter than tall and narrow ones.

You may also use regional clocks. Theymay cross the partition boundary in somegenerations of FPGAs, but at the cost ofusing general-purpose routing instead ofthe dedicated clock resources. You may useBUFIOs and I/O clocks in PR partitions,but they may not cross partition bound-aries. Figure 5 shows the RP (the white-bounded area) that will contain either ofthe two RMs in our example design.

XPERTS CORNER

Figure 7 – Design implementation with UDP response

Figure 6 – Design implementation with black-box RP

Figure 5 – Reconfigurable partition defined

Assign each RM to the RP that it willoccupy by selecting the partition in thenetlist tree, right clicking on it and select-ing “Assign Reconfigurable Module.” Thiswill allow you to choose the netlists of theRMs that will occupy that RP.

Constraints, Placing and Routing Constraints for the static-logic portion ofthe design are propagated down to each ofthe RMs. In most instances the top-levelUCF will be sufficient. If one or moreRMs require constraints beyond that ofthe static-logic constraints, you can addthem in this step. For example, it may benecessary to assign some specific compo-nent placement constraints to meet criticaltiming requirements. In general, the con-straints inherited from the top level will besufficient. Our example design required noadditional constraints within the twoRMs. If necessary, use the tpsync constraintto help constrain the paths that cross thereconfiguration boundary.

For designs that utilize PR, use addi-tional DRCs beyond those required for aconventional design to check variousaspects of the PR implementation. You caninitiate these DRCs from the tools menu.

Perform an initial implementation runto get a “golden reference” for the staticpart of the design. This will include thestatic routes that cross reconfigurable par-titions, as well as the proxy pins that areinside the reconfigurable partitions. Sinceall of the reconfigurable modules that maybe loaded into an RP must contain thesestatic elements, if you reimplement thestatic part of the design, you must alsoreimplement all of the RMs. Perform mul-tiple implementation runs to get the rest ofthe RMs implemented.

To provide a test case against which tocompare the two “real” designs, we did thefirst implementation of our example designwith a “black box” for the packet-processingengine. Note that a black box for an RMwill create a reset region of the FPGA; that’sa handy way to reduce the power of a devicewhen a function is not needed. Figure 6shows the black-box configuration, whileFigure 7 shows the same static logic withthe UDP packet response logic in place.

As each configuration is implemented,the place-and-route tools strive to meet thetiming constraints. After each place-and-route run, timing analysis is automaticallyrun on that configuration. Unless there arecombinatorial paths connecting two RPs,nothing special should be required. If thereare such combinatorial paths, use thetpsync constraint to break the path intotwo independent paths.

Create the Bit FilesBit files are created for each combinationof static logic and RMs from the imple-mentation step just completed. Once youhave a successful implementation run forenough configurations to cover all of theRMs, and have promoted the results, youcan highlight one or more configurationsin the design runs window, right click andselect “Generate Bitstream.” This will gen-erate a full bitstream for each selected con-figuration and partial bitstreams for eachRM in those configurations. No specialoptions are required.

Once the design has been completed,you need to load it into the target FPGA.For the power-on configuration, load thefull bit file including all of the static logicand the first RM in each RP. In somedesigns, the static logic may be loaded

without any logic in any or all of the RPs.This is the “trick” used in the PCI Expresscase to reduce configuration load time.The overall configuration process is short-ened when subsequent PR configurationload cycles are initiated.

In a partial-reconfiguration process, theFPGA already has an active configurationloaded. The DONE pin is not deassertedand no configuration memory is cleared, asthis could cause glitches in logic or routesthat are not being reconfigured. The FPGAis reconfigured one configuration frame ata time. Each configuration frame covers anarea that is one clock region tall by onedesign element wide. The design elementsare CLBs, BRAMs, DSP blocks and thelike. As each configuration frame is loaded,it becomes active.

Delivery of the partial bit files usesstandard interfaces: SelectMap, serial orJTAG configuration ports or the ICAPport. All FPGA bitstreams have CRC tocheck their validity. In a full configura-tion, CRC is checked before the device isallowed to become active to prevent thepossibility of damage from internal con-tention. In a partial configuration, thebitstream is being loaded into an alreadyactive FPGA. Also, the logic becomesactive immediately when a frame is


XPERTS CORNER

Figure 8 – ChipScope view of packet processing


loaded, but the CRC check is not doneuntil the full reconfiguration has beencompleted. Therefore, if it fails the CRCcheck after being loaded, there is a smallchance of damaging the FPGA if you failto take corrective action quickly.

There are several options. You canreload the same bitstream in case it wasjust a temporary error or load a differentknown-good bitstream, such as an emptyblack box. Alternatively, reconfigure theentire FPGA or check the integrity of thebitstream inside the FPGA before it isloaded via the ICAP interface.

For that fourth option, Xilinx has a ref-erence design that will add extra CRCs tothe bitstream for each configurationframe. It then checks the CRC of each

frame before it is loaded into the ICAP. Inthis way, any CRC error is caught beforethe frame is used. (See Xilinx AnswerRecord 35381 for more information.)

For the example design, the P-6 cardcontains a Spartan® FPGA that reads theinitial configuration from a MiniSD cardand programs the main V4FX60 FPGA.Once the FPGA is operating, the embed-ded Linux system in the Virtex-4 FPGAswaps in the selected RM for the packet-processing function the user desires.

We used the ChipScope™ tool to cap-ture operations within the logic of theFPGA to show the operation of the differentpacket-processing engines. Figure 8 showsthe Virtex-4 receiving a UDP packet, pro-cessing it and sending it back to the source.

Straightforward Flow Partial reconfiguration, or the notionthat a system can modify itself duringoperation, is no longer just for explor-ers on the extreme edge of design. ThePR design flow is straightforwardenough so that mainstream designteams can use it, reaping the attendantadvantages in system flexibility andadaptability, power savings and poten-tial system cost savings.

The design example discussed hereshould encourage everyone to consider PRfor their next design. Going forward,Faster Technology will offer a series of Webdemonstrations and live examples that youcan access over the Internet. Look for theseat www.fastertechnology.com.

XPERTS CORNER

by Luc LangloisGlobal Technical Marketing, [email protected]

Recent advances in high-speed data con-verters are expanding the boundaries ofperformance in a vast range of systems,including communications, medical andaerospace. As these trends intensify, the artof combining the high-speed analog signalchain with high-performance digital signalprocessing grows increasingly challenging.FPGAs have emerged as the solution ofchoice to meet these challenges, providinghigh-speed interfaces to the latest-genera-tion data converters and digital signal pro-cessing at high sampling rates.

Selected concepts of multirate digitalsignal processing form the basis for effi-cient high-speed data converter interfacesin Xilinx® Virtex®-6 and Spartan®-6FPGAs. By combining certain features ofthe FPGA architecture with parallel-pro-cessing techniques, your next design canattain performance beyond that of theFPGA fabric.


Multirate Digital Signal Processing for High-Speed Data ConvertersMultirate Digital Signal Processing for High-Speed Data ConvertersBy combining certain features of the FPGA architecture with parallel-processing techniques, your next design can attain performance beyond that of the FPGA fabric.

By combining certain features of the FPGA architecture with parallel-processing techniques, your next design can attain performance beyond that of the FPGA fabric.

XP LANAT ION:FPGA 101

Multirate Digital Signal Processing In many systems, it is desirable to use a fast ana-log-to-digital converter (ADC) to oversample ata rate fs (Megasamples per second, or MSPS)that is beyond the minimum sampling ratedefined by Nyquist, or twice the signal band-width fb. For example, a digital receiver for LTE(Long Term Evolution) wireless communica-tions may sample the analog signal at 122.88MSPS, while the baseband sampling rate maybe as low as 7.68 MSPS. Oversampling spreadsthe quantization noise inherent in the sam-pling process evenly across a wider bandwidth,leaving less noise power in the signal band-width of interest (Figure 1). Subsequent atten-uation of the out-of-band noise through adigital filter yields an improved signal-to-noiseratio compared with a critically sampled signal.

Once the signal has been oversampled bythe ADC and captured in the FPGA, it entersthe digital domain. At this point there is everymotivation to reduce the sampling rate prior toextracting the information content of the sig-nal. Reducing the sampling rate of a digital sig-nal is called decimation, and its benefitsinclude lower FPGA resource usage, lowercomputation rate, lower power consumptionand easier timing closure of the design.However, decimation alone will simply undothe benefits of oversampling by aliasing thequantization noise back into the signal ofinterest, thereby degrading the SNR anew. Tofind an efficient way to preserve the processinggain from oversampling, let us first review thebasic theory of decimation.

Decimation TheoryThe relationship between the input and out-put spectra of a discrete-time decimated signalwithout prior filtering is expressed inEquation 1 and shown in Figures 2 and 3.

Notice how decimation has the effect ofoverlaying D spectral replicas of the originalspectrum, each stretched by a factor D andshifted at intervals of 2π rad/sample. This cancause irreparable distortion due to aliasingunless the decimator is preceded by a low-pass filter to attenuate any spectral contentbeyond π/D (rad/sample).

To preserve the SNR benefits of oversam-pling, you must combine decimation with dig-ital filtering to attenuate out-of-band noise, acombination known as a decimation filter.


fB fs2

Signal

Quantization Noisefrequency

FPGA

ADCx(n)

DDsfsf

YD(n)

AliasingDecimator D = 2

–2π 2π–π π0

–2π 2π–π π0

Figure 1 – Band-limited signal, 4x oversampled

Figure 2 – Decimation without prior filtering

Figure 3 – Aliasing due to decimation without prior filtering


Equation 1

Polyphase Filters While the simple series cascade of filter anddecimator of Figure 4 is functionally correct,there exists a more efficient decimation filtertopology based on polyphase decomposition(as P.P. Vaidyanathan points out in MultirateSystems and Filter Banks, Prentice Hall, 1993).

A polyphase decimator comprises D sub-filters Hd(z–1), each of which contains a subsetof the total set of coefficients (Figure 5).Compared with a simple series-cascaded filterand decimator, the polyphase decimator is aparallel processor with the advantage that eachsubfilter operates at the slow sample rate, eas-ing timing closure of the design.

The combination of delay of increasingorder followed by decimation that distrib-utes fast incoming samples sequentially toeach subfilter is often depicted as a com-mutator, as shown in Figure 6, for a tradi-tional 4x polyphase decimator. It assumesthat incoming samples from the ADCarrive sequentially at the input of thepolyphase decimator.

Alternatively, you may concatenateincoming samples in groups of length D bya serial-to-parallel converter and presentthem to the input of the polyphase structureat a slower rate in block fashion (Figure 7).Such a mechanism, while functionallyequivalent to the amalgam of delays anddecimators, can greatly enhance the per-formance of polyphase filters in Virtex-6and Spartan-6 FPGAs due to their dedicat-ed serial-to-parallel converters.

Help from ISERDES, DSP48 Slices Xilinx Virtex-6 and Spartan-6 FPGAs containdedicated ISERDES serial-to-parallel convert-ers in each I/O block. The ISERDES avoidsthe additional timing complexities encoun-tered when constructing deserializers in theFPGA fabric. Designed for high-speed source-synchronous applications, the ISERDES canimplement an efficient polyphase decimatorby concatenating incoming samples fromhigh-speed ADCs to support sampling ratesbeyond 1 GSPS. Such high data rates wouldotherwise overwhelm a decimation filterimplemented with fabric-based deserializers.

In addition, the Virtex-6 and Spartan-6FPGAs contain dedicated hardware knownas DSP48E1/A1 slices that are optimized


FPGA

ADCx(n)

DH(z)D

sfsf

YD

Filter Decimator

(n)

4

4

4

4

x(n)

Dsf

sff

yD(n)

H0(z-1)

H1(z-1)

H2(z-1)

H3(z-1)

z-1

z-2

z-3

FPGA

ADC a b c d x(n)

Dsfsf

yD(n)

H0(z-1)

H1(z-1)

H2(z-1)

H3(z-1)

FPGA

ADC a b c d x(n)

sf

a

b

c

d

Serial toParallel

Converter1:4

H0(z-1)

H1(z-1)

H2(z-1)

H3(z-1)

yD(n)

4sf

Figure 4 – Series-cascaded filter and decimator

Figure 5 – 4x polyphase decimator

Figure 6 – Traditional 4x polyphase decimator

Figure 7 – 4x polyphase decimator with serial-to-parallel converter



for high-performance digital signal pro-cessing. Not only do the DSP48 slicesavoid the additional timing complexitiesencountered when implementing digitalsignal-processing functions in FPGA fab-ric, they are also very power-efficient. TheDSP48 slices are the ideal building blocksfor the polyphase subfilters.

The optimal design employs the DSP48slices near their maximum performance rat-ing (DSP48E1 fmax = 600 MHz, 1 speed

grade; DSP48A1 fmax = 390 MHz, 4 speedgrade). For example, the first decimationstage in Virtex-6 for a 1-GSPS ADC mightuse a 2x polyphase decimator in which eachsubfilter operates at 500 MSPS (Figure 8).

The same 1-GSPS ADC could beprocessed through a 4x polyphase decima-tor in Spartan-6, with each subfilter operat-ing at 250 MSPS (Figure 9).

Faster ADCs can accommodate higherdecimation rates, employing increased par-

allelism and slower throughput in each sub-filter. Xilinx FIR Compiler IP can assistdesigners in making the optimal trade-offsfor the polyphase subfilters, providingpushbutton implementation of differinghardware architectures such as overclockedDSP48 slices to process several coefficientsin time-multiplexed fashion.

Polyphase Interpolator There are situations that require an increasein the sampling rate, known as interpola-tion. For example, a cable modem typicallyemploys an FPGA-based digital upconvert-er to interpolate the baseband data up to afaster sampling rate suitable for driving ahigh-speed digital-to-analog converter(DAC). The motivation here stems fromthe fact that the higher the DAC samplingrate, the greater the separation in the fre-quency domain between spectral images atthe output of the DAC. This eases the taskof post-DAC analog filtering, yielding animprovement in signal-to-noise ratio.

As was the case for decimation, an inter-polation filter based on a polyphase structurehas advantages over a simple series-cascadedinterpolator followed by a filter. That’sbecause each subfilter operates at the slowercomputation rate, easing timing closure.When combined with the dedicated parallel-to-serial converter (OSERDES) in eachVirtex-6 or Spartan-6 I/O block, you canconstruct high-performance interpolationfilters with minimal FPGA fabric. Figure 10shows a 4x polyphase interpolator in aSpartan-6 driving a DAC at 1 GSPS.

Ripe for CustomizationWe have shown an efficient design method-ology for high-performance polyphase digitalfilters. Now you can customize the criticalprocessing stages of your high-speed dataconverter designs by harnessing the uniquefeatures of Virtex-6 and Spartan-6 FPGAsfor multirate digital signal processing.

The techniques described in this arti-cle are part of an Avnet Speedway designworkshop titled “FPGA-Based SystemDesign with High-Speed Data Con-verters,” to be held across the globe thisfall. For more information, please go towww.em.avnet.com/adcspeedway.

Virtex-6 FPGA

ADC a ba

b

ISERDES1:4

500 MHz

500 MSPS1 GSPS

DSP48E1 DSP48E1

DSP48E1 DSP48E1

YD (n)

Spartan-6 FPGA

ADC

a

bISERDES1:4

250 MHz

250 MSPS1 GSPS

DSP48A1 DSP48A1

DSP48A1 DSP48A1

c DSP48A1 DSP48A1

d DSP48A1 DSP48A1

a b c d

Spartan-6 FPGA

DACOSERDES

4:1

250 MHz

1 GSPS250 MSPS

DSP48A1 DSP48A1

DSP48A1 DSP48A1

DSP48A1 DSP48A1

DSP48A1 DSP48A1

Figure 8 – 2x polyphase decimator with serial-to-parallel converter in Virtex-6

Figure 9 – 4x polyphase decimator with serial-to-parallel converter in Spartan-6

Figure 10 – 4x polyphase interpolator with parallel-to-serial converter in Spartan-6


by Karsten TrottField Application EngineerXilinx GmbH, Munich, Germany

There are many ways to calculate a sinevalue for single-precision floating-pointnumbers, but one of the most powerful isto add hardware accelerators. That was ourconclusion after exploring a number ofapproaches to providing a good, fast andefficient solution when a customer applica-tion required this type of sine calculation.

To determine which implementationworks best for your application, first start byprofiling your code base to find out whichfunction needs some improvement. Then,because software is easier and quicker tochange than hardware, check to see if soft-

ware changes will deliver the higher speedyou need (sometimes they do). But if youstill require more performance, then thinkabout implementing parts of the algorithmin hardware. With hardware acceleration,you can easily outperform any microcon-troller or DSP on the market.

To get a feel for the process, let’s walkthrough a real-world case of how we han-dled a military application that needed asine calculation for single-precision floatnumbers. For cost/performance reasons,the customer had already chosen aSpartan®-6 FPGA with an embeddedMicroBlaze® that functioned as the mainsystem controller. The software algorithmthat handles the sine calculation shouldrun on that MicroBlaze.

The customer’s algorithm heavily uti-lized floating-point operations. Due tothe complexity of the algorithm, chang-ing over to fixed-point arithmetic was notan option. Also, the customer tried toavoid over- and underrun situations thatcan happen when fixed-point arithmeticis being used.

The customer was aware that theMicroBlaze IP provides two types of float-ing-point units (FPUs), and had alreadychosen the extended (as opposed to thebasic) version to accelerate the algorithm.However, this choice makes it impossibleto use the math emulation libraries that aredelivered together with the EDK as part ofthe GNU tool chain. The software emula-tion routines from the math library are


How to Speed Sine Calculations for Your ProcessorHow to Speed Sine Calculations for Your ProcessorIf software changes don’t deliver the speed you need, it’s easier than you might think to add hardware accelerators to your design.

ASK FAE -X

pretty slow; you should avoid them underall circumstances for performance-criticalparts of an algorithm.

The customer knows also that both ver-sions of the MicroBlaze FPUs can processonly single-precision data, but not double-precision data. The customer’s algorithmexplicitly used float precision data only. Butsometimes implicit conversions are donewhen you start using math functions. Theseconversions can force your algorithm to usedouble-precision data without being noticed.

Step 1: Analyze the ProblemOur customer had already run his algo-rithm, but found out that it was runningslowly on the MicroBlaze processor.Upon profiling the code base, the cus-tomer found that it was the sine calcula-tion that was causing the slowdown. Thenext step was to find out why this was thecase and figure out what to do to speedup the processing.

The first approach involved using thestandard sine function provided by themath library and to run the algorithmcompletely as the customer wrote it, with-out any modifications. The major draw-back is that the math library functions arecreated for double-precision data only.That means the prototype of the sine func-tion looks like this:

double sin(double angle);

But the customer would like to use itthe following way:

float sin_val;

float angle;

...

sin_val = sin(angle);

This is, of course, possible and the Ccompiler automatically adds the requiredconversions for the parameter angle todouble and to convert the result of thefunction call back to float value. Again, themath library functions usually executethese two additional conversion functionsand even the sine calculation.

Remembering that the FPU of theMicroBlaze is a single-precision versiononly leads to the following execution:

sin_val = (float)sin((double)angle);

function call. These are the key parameters,and the customer told us that he sometimeshas to calculate multiple sine values in arow (when, for example, filling small tablesbefore processing them).

A lookup table filled with all the valuesis obviously not possible due to therequired size of that table. The minimumnumber of entries is 360,000 float values(4 bytes per value). The customer waslooking for a fast solution, but it had to besize-efficient too.

Our proposed solution made use of thefollowing equation:

sin(xi) with xi = x + d

results in:sin(x+d) = sin(x)*cos(d) +

cos(x)*sin(d)

where d is a value that is always less than thesmallest possible x value (that is, greaterthan zero).

What’s the benefit of this solution? Weneed less space for the tables, but someadditional calculations. The tables are splitinto four tables at the beginning:

cos(x)

sin(x)

cos(d)

sin(d)

Figures 1 and 2 demonstrate the resolu-tion required for all four tables and howthese values typically look. The tablesshow entries for only 16 values to demon-strate what we had to place into ourlookup tables. We use much more in ourfinal solution.

In fact, we used 1,024 values in eachtable. The smallest value for x results in:360/1024 = 0.3515625 degree. All valuesfor d will be less than or equal to this num-ber. This strategy reduces the memory con-sumption regarding a full-blown lookuptable to 4,096 entries (4 bytes each).

The overall precision for the argumentthat we will reach using this approach is:

360/(1024*1024) = 0,000343 degree

And the precision is pretty good. The cal-culation fully utilizes the MicroBlaze FPU.

The real calculation takes some clockcycles—specifically, two fmul operations

Because of the double-precision declara-tion of the sine function by the mathlibrary, the FPU cannot do the sine calcu-lation. Instead, a pure-software solution isrequired. The drawback of this implemen-tation is that it was quite slow and did notfit the customer’s requirements.

We verified that the calculation of thesine value using double-precision data wasthe reason for the slow execution. First wecreated the assembler code directly fromour executable file by using:

mb-objdump.exe -D executable.elf

>dump.txt

Checking the assembler code, we foundthe following line:

brlid r15,-15832 // 4400d300 <sin>

This is exactly the call to the mathlibrary for double-precision sine calcula-tion. Then we measured a single sine calcu-lation utilizing the math library functions.It takes around 38,700 CPU cycles.

Special single-precision functions areavailable for certain jobs, such as calculat-ing the square root:

float sqrt_f( float h);

The usage of that special functionavoids the conversion from single- to dou-ble-precision data and can make use of theMicroBlaze FPU.

Unfortunately, there is no such func-tion to process the sine calculation on theFPU. At this point, we started developingseveral versions to speed up the calcula-tion of the sine value so as to gain moreperformance.

Step 2: Create a Better Software AlgorithmCreating hardware accelerators usuallytakes some time and debugging effort, sowe tried to avoid that approach in the firstrun. We discussed the performance issuewith our customer to get the key parame-ters for the sine calculation.

The customer algorithm requires a pre-cision of 1/100 degree resolution for theparameter angle of the sine calculation.And the calculated sine-value precisionshould be better than 0.1 percent com-pared with the result of the math library


ASK FAE -X

and one fadd operation. But we neededsome additional calculations. First, wehave to split the argument xi into the twovalues for x and d. Then we have to readout the values from our tables, and finallywe have to calculate the result using ournew algorithm.

When we implemented the algorithmin software and tested it, we got an overallclock cycle count of 6,520 clocks.

To further increase the resolution wecould use the following quadrant relations:

First quadrant:sin(x) = sin(x)

Second quadrant:sin(x) = sin(π - x)

Third quadrant:sin(x) = -sin(π + x)

Fourth quadrant:sin(x) = -sin(2* π - x)

This would increase the overall resolu-tion fourfold while keeping the tables atthe same size. On the other hand, we needsome additional computations to find outwhich quadrant we have to calculate for.There is still some room to improve thealgorithm or to decrease the size of thetables (by a factor of four). We have not yetimplemented that additional step.

Step 3: Optimize the AlgorithmBecause the solution we have reached sofar was still not fast enough for our cus-tomer, we tried to optimize the algorithma little bit more, still keeping everythingon the software side running on theMicroBlaze processor. There is one sim-ple optimization possible, but it costssome precision. That’s why we created asoftware model (running on the PC toget more speed) that runs over all possi-ble values and compares the original dou-ble value from sin() with the sine valuescreated by our software algorithm. Wedecided to run that algorithm on a stan-dard PC because doing the comparisonand calculation on the MicroBlaze takesquite a long time (remember that ourMicroBlaze is running at a much lowerspeed than the PC).

Now we started to optimize the calcula-tion to get the sine value:

sin(x+d) = sin(x)*cos(d) +

cos(x)*sin(d)

Since we used 1,024 values in eachtable, that means that d is always less than360 degrees/1,024 steps. Or:

cos(2* π /1024) = 0.99998

and this is nearly equal to 1.0.

For small values of d, the followingequation is valid:

cos(d) = ~1.0

That simplifies our formula to the follow-ing equation:

sin(x+d) = sin(x) + cos(x)*sin(d)

Using our PC model, we checked theprecision of that new equation before weimplemented it on the MicroBlaze andfound out that the maximum error was


1.5

1

0.5

0

-0.5

-1

-1.5

0 1 2 3 4 5 6 7

sin(x)

cos(x)

1.2

1

0.8

0.6

0.4

0.2

0

0 0.1 0.2 0.3 0.4

sin(x)

cos(x)

Figure 1 – Sine and cosine tables for the x-values, which have the range 0 to 360 degrees

Figure 2 – Sine and cosine tables for the d-values, which have the range 0 to 360/16 degrees

ASK FAE -X


still below the target of our customer.Now we implemented this algorithm on

the MicroBlaze as a software algorithm, stillusing tables with 1,024 entries each. Thenew algorithm requires only three tables,one less than in the previous implementa-tion. This saves memory space—and makestime for the additional calculations.

We measured the algorithm on ourhardware. It required 6,180 cycles for a sin-gle calculation.

Step 4: Further OptimizationsAnother optimization that seemed possiblewas to convert the floating-point value ofthe sine calculation and use an integer argu-ment here. The algorithm we were usingallowed us to create ~1E6 different values(1,024*1,024). An integer argument wassufficient to hold that number.

This optimization allowed us to use amuch simpler calculation for the splitting ofthe xi value into x and d. The splitting is justa simple AND operation plus some shiftingby 10 bits. The upper 10 bits of our parame-ter angle are xi while the lower 10 bits are d.

We again created a software model on thePC, checked it and then implemented it onthe MicroBlaze processor system. This took5,460 cycles for a single sine calculation.

Step 5: Thinking about Hardware ImplementationEven though the speed of the algorithm wasmuch improved compared with the originalcalculation from the math library, the cus-tomer required a much faster implementa-tion. But the last step above showed us aversion that can be easily converted tohardware.

It requires some operations for the split-ting of the xi value. In hardware, however,you can accomplish it by just a wiring ofthe required bits. Then we need threetables; we used inferred ROMs with prede-fined values calculated by our PC modeland then transferred into the VHDL codeof the IP. The IP reads all three tables at thesame time, again saving some computationtime. And finally, we require one float-MUL and one float-ADD operation. Forthis job, we found that the COREGenerator™ module for floating-point

operations works pretty well.We instantiated two of these hardware

modules, a job that required some slicesand some multipliers. Both cores required alatency of four to five cycles to match ourtiming of the design. The latency was notan issue yet and will be discussed in thenext steps.

We implemented the final IP as a FastSimplex Link (FSL) IP for the MicroBlaze.The first estimation on timing showed:

• one clock cycle to transfer the datafrom the MicroBlaze to the FSL bus

• one clock cycle to transfer the datafrom the FSL bus into FSL IP (datawill be immediately read from theBRAMs when the argument of thesinus calculation is read from FSL bus,so this doesn’t require any clock cycles)

• four clock cycles to process the MULoperation (cos(x)*sin(d))

• one clock cycle to store the result ofthat equation in a register

• four clock cycles to process the ADDoperation

• one clock cycle to send the data backto the FSL bus

• one clock cycle for the MicroBlaze toread the data from the FSL IP

Please note that the data for the argu-ment has to be stable over the whole time,assuming you do not use any additionalpipelining (we will discuss this in the nextstep). That means that the MicroBlaze canonly request a single sine calculation andhas to read the value after at least 13 clockcycles before it can ask for another calcula-tion.

Thus we estimated that it would take13 clock cycles to run that implementa-tion. And of course, a few more clockcycles are required for handling the func-tion calls on the software plus some addi-tional operations.

We implemented this IP in less than aday simply by putting some standard

FSL_DataRead ROM

sin(d)1024 x 32

ROMsin(x)

1024 x 32

SRL16

fmul

ROMcos(x)

1024 x 32

fadd

FSL_DataWrite

FSL_WriteFSL_Read

Figure 3 – Accelerator IP without pipelining

ASK FAE -X

clocks together, and measured the algo-rithm in hardware. The whole algorithm(mixed software/hardware) took 360 clockcycles (including all function calls). Thiswas a big step forward, but still not enoughto meet the customer’s needs.

We used one SRL16 to delay the writingof the signal until our accelerator IP wasprocessing all data.

The algorithm was now running in par-allel to our MicroBlaze—but it can onlycalculate one value at a time.

Step 6: Adding Pipelining and Adapting Customer CodeAt this point in the design, we started toadd pipelining to our core. The COREGenerator modules for float-ADD andfloat-MUL are already pipelined, so therewas nothing for us to do here. The first ver-sion required that the argument stay con-stant over time until the calculation is done.

After a new calculation starts (argu-ment data arrived inside the FSL IP), twoBRAMs are immediately read and thefloat-MUL is executed. The result of thatoperation is valid after few clock cycles.

Our argument xi of sin(xi) is a 20-bit-wide integer number, split into x and d.Therefore, we had to delay the MSB part x ofour argument xi a few clock cycles to read thecontent of the BRAM, storing xi, to matchto the result of the MUL operation. We useda few SRL16 elements, 10 in all, for our 10-bit-wide number, eating up to 10 LUTs (butdue to LUT combining in the Spartan-6, itwill require only five LUTs thanks to thatdevice’s wider LUT6 structure). The finaleffort was quite minimal for our implemen-tation. The additional SRL16x10 bit ismarked with the red circle in Figure 4.

Then we changed our FSL bus FIFOsusing the EDK wizards to allow storing ofmultiple values (we decided that storingeight values was sufficient for our purposes,but you can easily add more if needed).

That means that our customer canrequest up to eight values before he even asksfor the first result. This was sufficient for ourcustomer’s current needs, but in case hewants to request more sine values, he can eas-ily expand the FIFO buffers to larger values.

We discussed this new approach withthe customer and found out that it is fur-

ther possible to split the sine calculationinto two parts:

1. Requesting of the sine calculation(fslput operation)

2. Requesting of the result of the sinecalculation (fslget operation)

Because we have a fixed latency in thecalculation, if both of these operations exe-cute immediately one after the other, theMicroBlaze will stall and wait until the FSLIP has processed the request.

If we can split the two operations—which was possible in the customer algo-rithm—then we can further improve theoverall speed of the calculation.

With the additional pipelining, the finalcode for execution on the MicroBlaze lookslike this:

putfsl(arg1,fsl1_id);








...

getfsl(result1,fsl1_id);








This put us into a very good position.The core is fully pipelined and allows thesplitting of the two calls of the sine opera-tion. The latency of the IP core is still there,but not visible anymore. The MicroBlazewill no longer stall and wait for a pendingIP calculation, and this improves the over-all performance.

The customer agreed to change his codeaccordingly, which was a minor effort forhim. By using C macros instead of func-tion calls, we were being able to in-line allthe required calls into the code base.

The final implementation of that algo-rithm took only four clock cycles for a


FSL_DataRead ROM

sin(d)1024 x 32

ROMsin(x)

1024 x 32

SRL16

fmul

20 10 LSB

10 MSB

10 MSB

ROMcos(x)

1024 x 32

SRL16x10

fadd

FSL_DataWrite

FSL_WriteFSL_Read

Figure 4 – Accelerator core with pipelining

ASK FAE -X


single calculation. The whole latency ofthe processing is no longer visible, but ishidden by the splitting of calling andrequesting of the result. And the overallIP required a few additional BRAMs (sixfor our three tables) and a few additionalmultipliers/DSP slices plus some addi-tional slices.

But the results are quite astonishing. OurMicroBlaze now behaves like a super-high-end processor core, but is still running at apretty low frequency (it is now ~9,600 timesfaster than the original sine calculation).

Step 7: Further Optimization?When we reached this level of implemen-tation, our customer was satisfied with theresult and we finished our work on theaccelerator IP. The speed and the precisionwere good enough.

Of course, there is one final optimiza-tion that could be done. The algorithm canbe further refined if we evaluate the sin(d)values for very small values of d:

sin(d) = ~d

where d is less than 2*π/1024 – means lessthan 0,0061359, and the overall error isless than 1E-8 (for tables containing1,024 values).

The final step for our algorithm could be:

sin(x+d) = sin(x) + cos(x) * d

This would introduce only a very smalladditional error. But we could get rid of thethird table.

Of course, we would have to keep thefadd and fmul operators.

There might be other ways to calculatea sine value for float values, but thisapproach shows the power of additionalhardware accelerators. Our experienceshows that you need not be afraid if thealgorithm to be placed into hardware con-tains floating-point operations.

Karsten Trott is a Xilinx FAE in Munich,Germany. He holds a PhD in analog chipdesign and has a strong background in chipdesign and synthesis. You can reach him [email protected].

Figure 5 – The EDK enables a FIFO with a depth of eight for the FSL buses, to improve pipelining.

ASK FAE -X

Over the last year, Xilinx has investedheavily in making the design experience

easier for our customers. Xilinx® has devel-oped a unified silicon platform upon whichArtix™-7, Kintex™-7 and Virtex®-7 devicesare built, allowing for the greatest degree offlexibility and scalability for customers. Xilinxis also working toward a unified designmethodology, integrating the FPGA designflow into one, seamless design flow.

With the recent ISE® Design Suite 12.3release, Xilinx takes a bold leap in this journey.In this version, users will see two significantchanges. First, ISE Design Suite 12.3 willbegin supporting the AMBA® AXI4 StreamProtocol Specification. This will allow design-ers to leverage a wide variety of existing IP intheir FPGA designs, thereby reducing time-to-market. Second, Xilinx has invested heavily inthe PlanAhead™ design cockpit to provide asimpler, more consolidated view into yourdesign. This cockpit lets you manage yourentire design from within PlanAhead.

Enabled by a new and improved graph-ical user interface, PlanAhead now offersan RTL-to-bitstream pushbutton task-based flow with three main steps: synthe-sis, implementation, and program anddebugging. The tool has all the necessaryfeatures to efficiently manage projects byadding HDL file sources, IP cores, con-straints and other resources. You can visu-alize the design at any flow stage and anytime for analysis, including a clear view ofdesign resources and timing closure.

The AXI4 interconnect, meanwhile, is apoint-to-point interface developed toaddress system-on-chip performance chal-lenges. It supports pipelining, multipleclock domains and data upsizing anddownsizing. AXI4 also includes featuressuch as address pipelining, out-of-order

completion and multithreaded transac-tions. All of these features, when takentogether, allow much higher-performancesystems than those built over other busarchitectures. When converted to AXI4,Xilinx’s embedded-platform targeted refer-ence design provides twice the bandwidthof the previous targeted reference design.

Taking full advantage of these changesrequires a bit of a shift in the way usersthink about their design. Xilinx has invest-ed heavily in developing training to helpdesigners do just that.

Aligning Training to the ISE Design Suite 12.3 ReleaseIn step with the focus of the ISE 12.3release, Xilinx customer training is updat-ing the following courses to align withthe further progress of the PlanAheaddesign cockpit and the AXI4 IP prepro-duction release.

• Essential Design with the PlanAheadAnalysis and Design Tool: Learn tomanage design performance, plan anI/O pin layout and implement usingthe PlanAhead software tool.

• Advanced Design with the PlanAheadAnalysis and Design Tool: Advancedcapabilities of PlanAhead software helpyou to increase design performanceand achieve repeatable performance.

• Embedded-Systems Design: We haveupdated the embedded design coursesto include the new AXI interface, whileretaining the option to also utilize thePLB bus. This course teaches experi-enced FPGA designers to developembedded systems using the EmbeddedDevelopment Kit (EDK). The lecturesand labs also cover the features andcapabilities of the Xilinx MicroBlaze®

soft processor and the PowerPC® 440.

• Advanced Features and Techniques of Embedded-Systems Design: Thiscourse provides developers with the necessary training to develop complex

embedded systems and enables themto improve their designs by using thetools available in the EDK.

• Embedded-System Software Design:This course introduces you to soft-ware design and development forXilinx embedded-processor systems.You will learn the basic tool use andconcepts required for the softwarephase of the design cycle, after thehardware design is completed.

Additionally, we have updated the fol-lowing classes.

• Xilinx Partial Reconfiguration Toolsand Techniques: This course demon-strates how to use the ISE, PlanAheadand EDK software tools to construct,implement and download a partiallyreconfigurable (PR) FPGA design.

• Designing a LogiCORE® PCIExpress® System: Updated to utilizethe connectivity targeted referencedesign as the primary lab design, thisclass will give you a working knowl-edge of how to implement a XilinxPCI Express core in your applications.

• Labs targeting Virtex-6 demonstra-tion boards (ML605) have updatedcontent.

• Free recorded E-learning contentnow includes the latest devices andsoftware (available at http://www.xilinx.com/training/ ).

A Global PushEach of these classes is available world-wide today. There are 26 XilinxAuthorized Training Providers operatingin nearly 50 countries (see chart). Spaceis limited, so sign up today. To view thenew, interactive worldwide trainingschedule, go to http://www.xilinx.com/t ra in in g /wo r l dw id e - s c h edu l e . h tm .Otherwise, contact your local ATP formore details: http://www.xilinx.com/training/atp.htm.

Take Advantage of a Simplified Design Experience

ARE YOU XPER IENCED?

by Tom ThomasTraining Program Manager

Rhett WhatcottCustomer Training Development ManagerXilinx, Inc.



AUTHORIZED TRAINING PARTNER CONTACT COUNTRY/REGION(S) SUPPORTED

Xilinx Training Worldwide www.xilinx.com/training Worldwide

AMERICAS [email protected]

Anacom Eletrônica www.anacom.com.br Brazil

Bottom Line Technologies www.bltinc.com Delaware, District of Columbia, Maryland, New Jersey, New York, Eastern Pennsylvania, Virginia

Doulos www.doulos.com/xilinxNC Northern California

Faster Technology www.fastertechnology.com Arkansas, Colorado, Louisiana, Montana, Oklahoma, Texas, Utah, Wyoming

Hardent www.hardent.com Alabama, Connecticut, Eastern Canada, Florida, Georgia, Maine,Massachusetts, Mississippi, New Hampshire, North Carolina, Rhode Island,South Carolina, Tennessee, Vermont

Multi Videa Designs (MVD) www.mvd-fpga.com Argentina, Brazil, Mexico

North Pole Engineering www.npe-inc.com Illinois, Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, SouthDakota, Wisconsin

Technically Speaking www.technically-speaking.com Arizona, British Columbia, Southern California, Idaho, New Mexico, Nevada,Oregon, Washington

Vai Logic www.vailogic.com Indiana, Kentucky, Michigan, Ohio, Western Pennsylvania, West Virginia

EUROPE, MIDDLE EAST & AFRICA (EMEA) [email protected]

Arcobel Embedded Solutions www.arcobel.nl The Netherlands, Belgium, Luxembourg

Bitsim AB www.bitsim.com/education Sweden, Norway, Denmark, Finland, Lithuania, Latvia, Estonia

Doulos www.doulos.com/xilinx United Kingdom, Ireland

Inline Group www.plis.ru Moscow Region

Logtel Computer Communications www.logtel.com Israel, Turkey

Magnetic Digital Systems www.magneticgroup.ru Urals Region

Mindway www.mindway-design.com Italy

Multi Video Designs (MVD) www.mvd-fpga.com France, Spain, Portugal, Switzerland

Pulsar Ltd. pulsar.co.ua/en/index Ukraine

Programmable Logic Competence Center (PLC2) www.plc2.de Germany, Switzerland, Poland, Hungary, Czech Republic, Slovakia, Slovenia,Greece, Cyprus, Turkey, Russia

SO-Logic Consulting www.so-logic.co.at Austria, Brazil, Czech Republic, Hungary, Slovakia, Slovenia

ASIA PACIFIC [email protected]

Active Media Innovation www.activemedia.com.sg Malaysia, Singapore, Thailand

Black Box Consulting www.blackboxconsulting.com.au Australia, New Zealand

E-elements www.e-elements.com China, Hong Kong, Taiwan

Libertron www.libertron.com Korea

OE-Galaxy Co., Ltd. [email protected] Vietnam

Sandeepani Programmable Solutions Pvt. Ltd. www.sandeepani-vlsi.com India

Symmid www.symmid.com Malaysia

WeDu Solution www.wedusolution.com Korea

JAPAN [email protected]

Avnet Japan www.jp.avnet.com Japan

Paltek www.paltek.co.jp Japan

Shinko Shoji Co., Ltd. xilinx.shinko-sj.co.jp Japan

Tokyo Electron Device Ltd. ppg.teldevice.co.jp Japan

ARE YOU XPER IENCED?

New Device SupportISE Design Suite 12.3 adds support for theSpartan-6 Lower Power (-1L) and XASpartan-6 production devices (via the 12.3speed files patch); XA Spartan-6 -3 speedgrade; and defense-grade Virtex®-6QDevices (I-Grade only).

ISE Design Suite: Logic Edition

Ultimate productivity for FPGA logic designLatest version number: 12.3Date of latest release: October 2010Previous release: 12.2URL to download the latest release:www.xilinx.com/download

Revision highlights:The ISE Design Suite Logic Editionincludes several updates, most notably thefinal release of ModelSim Xilinx Edition-III and Xilinx’s emphasis on its ISIMsimulator as its flagship simulation envi-ronment going forward. For more infor-mation, please see change notice XCN10028 at http://www.xilinx.com/support/d o c u m e n t a t i o n / c u s t o m e r _ n o t i c e s /xcn10028.pdf.

Project Navigator: Integration is improvedthanks to support for custom SmartXplorerstrategy files from within Project Navigator,the ability to save only “N” best results inorder to efficiently manage disk space,additional resource-utilization data dis-played in the SmartXplorer Results windowand the ability to abort individualSmartXplorer runs.

In addition, the project archive now hasthe ability to archive sources only, alongwith a new option to exclude generatedfiles when creating a project archive. Thetool includes a new option to create VHDLlibraries based on file directory name andwill automatically add all VHDL files froma selected directory.

PlanAhead: PlanAhead boasts a number ofnew features, the biggest of which are anupdated AXI IP core in the IP Catalog andintegration with CORE Generator™; pin-planning usability improvements, includ-ing table-based constraint editing, packagepin location swaps and improved visibilityof configuration settings; and improvedSSN prediction for Spartan-6 devices.

The tool also includes a number ofproject infrastructure improvements, suchas the ability to launch runs without

unnecessary copying of sources, improve-ments to functionality for when directorytrees are added as sources, manual file-ordering control for synthesis, backups tojournal and log files, passing PCF files tobitgen and copy run features.

Among several GUI improvementsand changes, the 12.3 version hasremoved on-the-fly creation of user-defined attributes and offers additionaloptions for find-in files. Other GUIchanges include the ability to open a proj-ect directly from the Getting Started page;UCF file order display; the ability to opena file browser in a run directory; improvedidentification of read-only files; tool tipsfor missing files; and an option to clear alloutput from the Tcl console view.

This version of PlanAhead includesfloorplanning improvements for swap-ping macro location constraints for BlockRAMs and DSPs, improvements toChipScope™ integration for CDC fileexport and a more consistent debug core-naming convention. It also includes anew tutorial for Tcl and SDC commands.

ChipScope Pro: ISE Design Suite: LogicEdition includes IBERT Support forIBERT CORE Generator and an analyzerIBERT console, both for Virtex-6 GTHrev2.0. An example file for IBERT COREGenerator GUI for the Virtex-6 GTX showsyou how to include IBERT core logic inyour designs. Also, an AXI monitor in XPSsupports AXI4-based designs for Spartan-6and Virtex-6 families, along with AXI4-Memory Map and AXI4-Lite interfaces.

Implementation (place and route): The ISEDesign Suite implementation tools nowsupport intelligent clock gating forSpartan-6 FPGAs (including BRAMoptimizations).


Xilinx Tool & IP UpdatesXTRA, XTRA

In early October, Xilinx made available the ISE® Design Suite 12.3. Thisnew version features the rollout of several intellectual-property (IP) coresthat meet the AMBA® AXI4 specification for interconnecting functionalblocks in system-on-chip (SoC) design. It also introduces several productivityenhancements to the PlanAhead™ tool, most notably a new RTL-to-bit-stream design flow that has a new and improved user interface and projectmanagement capabilities. ISE Design Suite 12.3 also includes intelligentclock-gating support for reducing dynamic power consumption inSpartan®-6 FPGA designs. The download is available at http://www.xil-inx.com/support/download/index.htm.

ISE Design Suite: DSP Edition

Flows and IP tailored to the needs of algo-rithm, system and hardware developers Latest version number: 12.3Date of latest release: October 2010Previous release: 12.2URL to download the latest release:www.xilinx.com/download

Revision highlights:All ISE Design Suite Editions includethe enhancements listed above for theISE Design Suite: Logic Edition. Specificto the DSP Edition, ISE 12.3 introducesseveral enhancements to both SystemGenerator for DSP and DSP IP.

AXI4 support: System Generator andCORE Generator 12.3 contain newAXI4 IP, specifically the complex mul-tiplier 4.0, DDS Compiler 5.0, FFT8.0 and FIR Compiler 6.0. The DSPEdition also includes beta support forAXI Pcore generation. That is, busabstraction will make it possible to portexisting designs over to AXI by justselecting a radio button on the EDKblock in System Generator. You canimport and co-simulate AXI-basedEDK designs using hardware co-simu-

lation. The DSP Edition also includesSDK co-debug of AXI systems. In par-ticular, it offers software debug withSDK of AXI systems, Pcore design anddebug in System Generator and anupdated tutorial for the AXI FIR andAXI Pcore.

Run-time improvements: The tool pro-vides a 2x to 3x decrease in run-timefor the entire flow, from design loadingto netlisting, in System Generator 12.3.It also includes improvements in theCompile (Ctrl-D), Initialization,Simulation and Netlisting steps.

XPS Ports View can now group ports aspart of a bus or an I/O interface. It nowincludes the ability to easily connect ports of aninterface to a module outside the EDK subsys-tem. Users can rearrange the order in which IPblocks are shown in the System Assembly Viewand customize the view to suit their design.

SDK enhancements: The 12.3 version of SDKsupports AXI-based designs. MicroBlazev8.00a includes little-endian support (AXIonly) and a GNU tool chain update, includ-ing a new –m little-endian switch.MicroBlaze also includes make-file genera-tion and software tool updates, includingSDK, XMD, Libgen and FlashWriter.

System Generator co-debug is now enabledfor AXI IP. The SDK also supports drivers andBSP for AXI IP, along with sample applicationgeneration and Xilkernel support. An AXIinterface is available for MDM v2.00a.

SDK 12.3 has several interface updates. Inthis version, relative paths for repositories areallowed if the repository is at the same direc-tory level as the BSP directory or one levelabove. Repository paths are refreshed whenimporting BSP projects. Manual modifica-tion of the BSP settings file (MSS) now trig-gers a rebuild. Restoring defaults in C/C++build settings no longer removes inferredoptions. Default suggested Windows work-space location has been changed.

A Hello World SDK sample applicationsupports MDM UART, while a screencastdemonstrating how to use SDK is availablefrom the SDK Welcome page. Several bugpatches are provided with 12.3 AnswerRecord 36777 at http://www.xilinx.com/support/answers/36777.htm.

Xilinx CORE Generator & IP UpdatesName of IP: ISE IP Update 12.3Type of IP: All

AXI support:With the release of ISE Design Suite 12.3,Xilinx is introducing CORE Generator IP with

ISE Design Suite: Embedded Edition

An integrated software solution for designing embedded processing systemsLatest version number: 12.3Date of latest release: October 2010Previous release: 12.2URL to download the latest patch:www.xilinx.com/download

Revision highlights:All ISE Design Suite Editions include theenhancements listed above for the ISEDesign Suite: Logic Edition. The following isthe list of enhancements specific to theEmbedded Edition.

XPS enhancements: With ISE 12.3, XPSincludes support for AXI-based designs. Toaccomplish this, Xilinx has added Base SystemBuilder support for single-MicroBlaze® AXI-based designs along with the ability to stitchtogether a system with AXI-based IP inter-faces. Each AXI master or slave can run at adifferent frequency; XPS automatically han-dles any clock conversion logic. The XPSSystem Assembly View connectivity panel pro-vides a mechanism to capture sparse connec-tivity between various AXI masters and slavesfor AXI interconnect IP. The AXI InterconnectIP Configuration GUI includes a dialog tochange master- or slave-specific settings. In theSystem Assembly View, it will automaticallyadd a slave module at the time of instantiation.

Support for embedded-software develop-ment is not available in XPS for AXI-baseddesigns. The Xilinx SDK remains the primaryembedded-software development tool. XPScontinues to support software development forPLB designs in a deprecated mode.

XPS includes streamlined integration ofMIG flows for AXI-based DDR interfaces,along with AXI ChipScope Monitor supportand the ability to cascade multiple AXI inter-connect IP blocks in a system, as well as theability to easily connect to an AXI module thatis outside the EDK subsystem.


XTRA, XTRA

the AXI4 interface. Xilinx has updated thelatest versions of the CORE Generator IPlisted below with various forms of AXI4interface support.

In general, the latest version of a givenpiece of IP for Virtex-6 and Spartan-6 devicefamilies will support the AXI4 interface.Older “production” versions of IP will contin-ue to support the legacy interface for therespective core on Virtex-6, Spartan-6, Virtex-5, Virtex-4 and Spartan-3 device families.

For general information on Xilinx AXI4support, see www.xilinx.com/axi4.htm. Formore information on Xilinx AXI4 IP support,see www.xilinx.com/ipcenter/axi4_ip.htm.

Connectivity IP: Aurora 8b/10b v6.1 is nowavailable with the AXI4-Stream interface.Performance capabilities are retained fromthe LocalLink version with minimal coresize increases. CAN v4.1 and DisplayPortv2.1 now come with an AXI4-Lite interface.The PCI Express® core retains the perform-ance capabilities of the TRN version with aminimal increase in core size. The Spartan-6and Virtex-6 integrated block for PCIExpress v2.1 is now available with the AXI4-Stream interface.

DSP IP: FIR Compiler v6.0, DDS Compilerv5.0, Complex Multiplier v4.0 and FastFourier Transform (FFT) v8.0 are now avail-

able with the AXI4-Stream interface. Therelease provides a demonstration VHDLtestbench for the selected CORE Generatorconfiguration and supports the upgradefrom the previous version with a legacyinterface. (Please note that this upgrade willonly affect core parameters. You must adaptthe core instantiation in the design to usethe AXI4-Stream interface. For furtherinformation, please refer to the section onmigrating from earlier versions in therespective IP data sheet.

Memory and storage elements: FIFOGenerator v7.2 is now available with AXI4,AXI4-Lite and AXI4-Stream and nativeinterfaces. The FIFO Generator automati-

cally uses write-first mode for Spartan-6SDP BRAM-based FIFO configurationswhen different clocks are used for reducedBRAM utilization. MIG v3.6 is now avail-able with AXI4 and native interfaces.

Embedded processor IP: Platform Studio nowincludes more than 30 new AXI4-compli-ant cores, such as MicroBlaze version 8.00asupporting both PLBv46 and AXI inter-faces, AXI Ethernet and AXI Ethernet Lite,AXI USB 2.0 Device, AXI-based DMAsand AXI Serial Peripheral Interface.

Wireless IP: The new 3GPP LTE ChannelEstimator v1.0 is available with AXI4-Stream interface. This latest component ofXilinx’s LTE Baseband Targeted DesignPlatform provides an optimized channel-estimation function for the physical-uplink shared channel (PUSCH) in LTEbase stations. Antenna configurations upto and including 2 x 2 multiuser MIMOare supported. You can tailor this config-urable function to meet the unique needsof your application.

Additional IP Highlights

• Digital predistortion (DPD) v3.0 getsits first release in CORE Generator.This market-leading DPD solution forcommon wireless standards reduces basestation equipment capital and operatingexpenditures by increasing poweramplifier efficiency. Configurable tosupport different algorithmic perform-ance and area combinations, it allowscustomers to uniquely tune the coreimplementation to their needs.Supporting up to eight antennas, itoffers the smallest, lowest-power andlowest-cost FPGA-based DPD solutionavailable in the market today.

• Soft-error mitigation (SEM) v1.1 auto-matically detects and corrects soft errorscaused by ionizing radiation to enable thehighest system reliability and availabilityand reduce system costs. You can config-

ure this feature to support multipleerror-correction methods. Optional errorclassification and error injection areincluded. The IP, which is delivered inCORE Generator, supports the XilinxVirtex-6 FPGA family.

• Block Memory Generator v4.3 has newwrite-first mode support for single dual-port (SDP) memory type for Spartan-6devices. Use the write-first mode insteadof read first for SDP BRAM when theread and write ports are clocked by dif-ferent clocks to avoid address space over-lap issues. The core also supports softHamming error correction in SDPBRAM configurations for data widths <64 bits (Virtex-6, Virtex-5 and Spartan-6devices only).

You will find a comprehensive listing ofcores that have been updated in this release atwww.xilinx.com/ipcenter/coregen/12_3_datasheets.htm. For more information, see www.xil-inx.com/ipcenter/coregen/updates_12_3.htm.

Additional CORE Generator EnhancementsThe CORE Generator catalog now includesa new “AXI4” column highlighting IP coreswith AXI4 support. The IP informationpanel displays information on supportedAXI4 and native interfaces.

You can expand IP symbols to display thecomponents and their bit allocation on a sin-gle Tdata channel. For example, you canexpand a complex multiplier’s s_axis_a_tdata[31:0] pin symbol to show the packing ofreal[15:0] and imaginary[31:16] components.

You can now access supplementary IPdocuments such as user guides from the pull-down menu in the IP GUI as well as fromthe IP information panel.

Additional PlanAhead IP Design Flow EnhancementsA new “AXI4” column in the IP cataloghighlights IP cores with AXI4 support,while the IP information panel displaysinformation on supported AXI4 and nativeinterfaces.

XTRA, XTRA


WITH ELECTRICITY BILLSaccounting for a large part of operating

expenses (OPEX), and OPEX itself

accounting for ~70% of total costs,

operators have had to pay attention

to power consumption. Traditionally,

silicon suppliers have looked at transistor

and process technology to find ways

to lower power consumption. While a

major contributor, transistors are not the

only factor and can only get you so far.

Power reduction is better achieved

with a more holistic system-level

approach. The best results are achieved

by taking into consideration silicon

process technology, leveraging

power-aware tools, designing code

for low power, adjusting system-level

architectures, and applying algorithms

that can significantly reduce system-level

power (e.g., digital pre-distortion [DPD]

in a remote radio head application).

Choosing the right silicon partner

can help. Xilinx approaches power

management as just described —

holistically — rather than just focusing

myopically on transistors and process

node technology. Xilinx® FPGA platform

solutions help designers adopt power-

optimizing design approaches and system-

level design and integration techniques

that broadly address the power problem.

At the design level, Xilinx power-aware

tools, as well as an extensive library

of power-efficient reference designs

and application notes help engineers

optimize overall power consumption.

Dedicated teams of Xilinx application

engineers can also help designers meet

stringent power goals. Xilinx engineers

can walk customers through design

optimization techniques such as folding

a DSP-intensive design to reduce the size

of the design, therefore lowering static

power consumption and cost by using a

smaller device.

At the system level, Xilinx’s attention

to integration has yielded great results.

For example, integrating multiple discrete

components into a single FPGA greatly

reduces the total amount of system I/O,

which can account for a significant

amount of the power draw. Furthermore,

using advanced algorithms like DPD in

a remote radio head allows telecom

equipment manufacturers (TEM) to use

a lower-power, lower-cost power

amplifier — having the most significant

impact on system-level power

consumption.

Certainly Xilinx recognizes that

transistor and process node technology

cannot be completely ignored when

reducing power. The 28nm Xilinx 7-Series

FPGAs enable a 50% overall power

reduction compared to the previous

40nm generation. In terms of transistor

technology, Xilinx’s low-power process

and use of multiple transistor sizes

minimize static power. Xilinx FPGAs

use hard blocks for DSP, memory, and

SERDES, which greatly minimizes dynamic

power consumption over comparable

DSP and other FPGA designs.

Addressing the power challenge at

the transistor level will get you started

in the quest to lower power and lower

OPEX. However, only by fine-tuning

all levels holistically will you see the

greatest results.

To learn more about Xilinx’s

affordable, power-optimized wireless

solutions, please visit www.xilinx.com/

esp/wireless.

A D V E R T I S E M E N T

By Manuel Uhm

Meeting the Power Challenge: Transistors Only Take You So Far

~45% reduction

in cost

Designs based on Xilinx FPGAs can take advantage of industry-leading functional density and advanced radio algorithms, such as DPD, to minimize external circuitry and reduce the power consumption of the power amplifier, thereby minimizing power for the complete system.

About the Author: Manuel Uhm is the Director of Wireless Communications at Xilinx Inc. (San Jose, Calif.) and Chair of the User Requirements Committee, Wireless Innovation Forum (SDR Forum v2.0). Contact him at [email protected]

© Copyright 2010 Xilinx, Inc. Xilinx, the Xilinx logo, Virtex, Spartan, ISE, and other designated brands included herein are trademarks of Xilinx in the United States and other countries. All other trademarks are the property of their respective owners.


Xpress Yourself in Our Caption Contest

XCLAMAT IONS!

NO PURCHASE NECESSARY. You must be 18 or older and a resident of the fifty United States, the District of Columbia, or Canada (excluding Quebec) to enter. Entries must be entirely original and must be received by 5:00 pm Pacific Time (PT) on December 31, 2010. Official rules available online at www.xilinx.com/xcellcontest. Sponsored by Xilinx, Inc. 2100 Logic Drive, San Jose, CA 95124.

Let’s talk turkey. If you have a yen to Xercise your funny bone, here’s your opportunity. In the run-up to the holidays, we invite youto cook up an engineering- or technology-related caption for this cartoon depicting some unusual goings-on in a corporate

conference room. The image might inspire a caption like “Does anyone know the mathematical formula for gravy?”

Send your entries to [email protected]. Include your name, job title, company affiliation and location, and indicate that you have readthe contest rules at www.xilinx.com/xcellcontest and agree to them. After due deliberation, we will print the submissions we like the bestin the next issue of Xcell Journal and award the winner the Xilinx® SP601 Evaluation Kit, our entry-level development environmentfor evaluating the Spartan®-6 family of FPGAs (approximate retail value, $295; see http://www.xilinx.com/sp601). Runners-up will gainnotoriety, fame and a cool, Xilinx-branded gift from our SWAG closet.

The deadline for submitting entries is 5:00 pm Pacific Time (PT) on December 31, 2010. So, get writing!

DA

NIE

LG

UID

ER

A

The Best Prototyping System Just Got Better

� Highest Performance

� Integrated Software Flow

� Pre-tested DesignWare® IP

� Advanced Verification Functionality

� High-speed TDM

HAPS™ High-performance ASIC Prototyping Systems™

For more information visit www.synopsys.com/fpga-based-prototyping

HALF THEPOWER

TWICE THE PERFORMANCEA WHOLE NEW WAY OF THINKING.

Powerful, flexible, and built on the only unified architecture to span low-cost to ultra high-end FPGA families. Leveraging next-generation ISE Design Suite, development times speed up, while protecting your IP investment. Innovate without compromise.

LEARN MORE AT WWW.XILINX.COM / 7

Lowest power and cost

Best price and performance

Highest system performance and capacity

Introducing the 7 Series. Highest performance,

lowest power family of FPGAs.

© Copyright 2010. Xilinx, Inc. XILINX, the Xilinx logo, Artix, ISE, Kintex, Spartan, Virtex, and other designated brands included herein are trademarks of Xilinx in the United States and other countries. All other trademarks are the property of their respective owners.

PN 2463

Documents

Xcell Journal Issue 73