84
R Issue 64 Second Quarter 2008 Xcell journal Xcell journal SOLUTIONS FOR A PROGRAMMABLE WORLD SOLUTIONS FOR A PROGRAMMABLE WORLD COVER Embedded Processing Innovations with Virtex-5 FXT Devices INSIDE Next-Generation Wireless Standards Using Virtex-5 FXT Devices FPGA Coprocessors for Spacecraft Image Processing Designing Multiprocessor SOCs Building Linux Platforms on Xilinx Processors COVER Embedded Processing Innovations with Virtex-5 FXT Devices INSIDE Next-Generation Wireless Standards Using Virtex-5 FXT Devices FPGA Coprocessors for Spacecraft Image Processing Designing Multiprocessor SOCs Building Linux Platforms on Xilinx Processors www.xilinx.com/xcell/ Inside the New Virtex-5 FXT FPGA Inside the New Virtex-5 FXT FPGA EMBEDDED PROCESSING SOLUTIONS EDITION

EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

R

Issue 64Second Quarter 2008

Xcell journalXcell journalS O L U T I O N S F O R A P R O G R A M M A B L E W O R L DS O L U T I O N S F O R A P R O G R A M M A B L E W O R L D

COVEREmbedded Processing Innovations with Virtex-5 FXT Devices

INSIDE

Next-Generation Wireless Standards UsingVirtex-5 FXT Devices

FPGA Coprocessors forSpacecraft Image Processing

Designing Multiprocessor SOCs

Building Linux Platforms on Xilinx Processors

COVEREmbedded Processing Innovations with Virtex-5 FXT Devices

INSIDE

Next-Generation Wireless Standards UsingVirtex-5 FXT Devices

FPGA Coprocessors forSpacecraft Image Processing

Designing Multiprocessor SOCs

Building Linux Platforms on Xilinx Processors

www.xilinx.com/xcell/

Inside the NewVirtex-5 FXT FPGA

Inside the NewVirtex-5 FXT FPGA

EMBEDDED PROCESSING SOLUTIONS EDITIONISSUE 64, SECOND QUARTER 2008XCELL JOURNAL

XILINX, INC.

Page 2: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

1 800 332 8638www.em.avnet.com

©Avnet, Inc. 2008. All rights reserved. AVNET is a registered trademark of Avnet, Inc.©2008 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included

herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

Bringing Products to Life.

At Avnet Electronics Marketing, support across the board is much more than a tagline for us. From design to delivery – we are deeply committed to driving maximum efficiency throughout the product lifecycle.

The Challenge When the world’s largest FPGA maker Xilinx was mounting the launch of its latest Spartan™-3 Generation family of low-cost FPGAs, the company wanted to ensure designers would have everything they need to compete in the highly competitive high-volume market. Xilinx turned to longtime partner Avnet to lend its technical expertise in creating a complete solution.

The Solution Avnet spent countless engineering hours alongside Xilinx – developing a comprehensive solution of starter kits, development boards, and Speedway™ educational workshops. With Xilinx® Spartan-3 Generation FPGAs and Avnet’s support across the board, designers have access to all they need to save up to 50 percent in total system cost over competing FPGAs.

Visit www.em.avnet.com/xilinxspartan3kits to learn more about Spartan-3 solutions and to purchase a Spartan-3A evaluation kit for only $39!

Support Across The Board™

Bryan FletcherAvnet

Global Technical MarketingManager

David LoftusXilinxVice President / General ManagerGeneral Products Division (GPD)

Page 3: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

F

L E T T E R F R O M T H E P U B L I S H E R

Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/

© 2008 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. All other trade-marks are the property of their respective owners.

The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby. Forrest Couch

Publisher

PUBLISHER Forrest [email protected]

EDITOR Charmaine Cooper Hussain

ART DIRECTOR Scott Blair

DESIGN/PRODUCTION Teie, Gelwicks & Associates1-800-493-5551

ADVERTISING SALES Dan [email protected]

TECHNICAL COORDINATORS Larry CaputoJay Gould

INTERNATIONAL Piera Or, Asia [email protected]

Christelle Moraga, Europe/Middle East/[email protected]

Yumi Homura, [email protected]

SUBSCRIPTIONS All Inquirieswww.xcellpublications.com

REPRINT ORDERS 1-800-493-5551

Xcell journal

www.xilinx.com/xcell/

Integrated System Platforms: The Next Wave in FPGA-based Innovationfor Embedded Processing ApplicationsFPGAs have come a long way since the days serving as prototypes to test design concepts beforecommitting them to ‘permanent’ silicon. FPGAs have now become the foundation for many ofthe most innovative products that improve our lives everyday. Driven by diverse and demandingmarket requirements, FPGAs are valued above all else for their flexibility. Their inherent pro-grammability is the insurance policy product developers depend on to quickly and efficiently meetmarket windows and keep products current in fast moving, evolving markets.

As many of the articles published in Xcell Journal illustrate, FPGAs are no longer a single-functiondevice. These programmable devices have emerged as the preferred platform for the highly integrated world of systems-on-chip (SoC) and embedded system design. Therefore, it is no coinci-dence that many of you will see this issue of our magazine at the Embedded System Conference inSan Jose, CA, one of the largest gatherings of hardware designers and software developers.

It’s also appropriate that this theme is highlighted as Xilinx introduces its ultimate system integra-tion platform – the Virtex™-5 FXT FPGA. Comprising the fourth platform in the 65-nanometerVirtex-5 family, Virtex-5 FXT devices combine flexibility with very high-performance embedded pro-cessing, digital signal processing and connectivity capabilities into a single chip to reduce total systemcost, power, and board real estate.

In this IssueThe significance of the Virtex-5 FXT platform for embedded developers is underscored in thecover article, which describes in detail the industry’s first FPGAs with embedded PowerPC®

440 processor blocks, high-speed RocketIO™ serial transceivers, and dedicated XtremeDSP™processing capabilities. These devices deliver an optimal system integration platform that meetsmarket and bandwidth requirements of transporting voice, video, and data for a wide range ofapplications in wired and wireless communications, audio/video broadcast equipment, military,aerospace and industrial systems, along with many others. This issue provides a “behind-the-scene” look into a variety of FPGA-based embedded system applications, ranging from outerspace and aerospace to more earthly scientific and automotive innovations.

In addition, you’ll learn how every aspect of Xilinx embedded processing solutions has beensubstantially upgraded and usability drastically simplified through intuitive hardware and softwaretools. Advancements include the v7 release of the MicroBlaze™ 32-bit soft processor with con-figurable memory management unit, the new high-performance processor interconnect architec-ture, and the ISE™ Design Suite 10.1 development environment with unified access to all logic,embedded, and DSP design tools. Of course, our expansive ecosystem of third-party embeddedtechnology and service providers, several of whom are featured in this issue, also play a key rolein helping developers maximize the benefits of Xilinx embedded solutions.

At the end of the day, the goal of these efforts is to enable rapid design of high-performanceembedded systems with highly integrated, scalable hardware platforms, customizable embeddedprocessors, and pre-verified IP cores. Designers can then focus their efforts on creating differenti-ated value, thereby getting products to market in time to extract maximum return and ultimatelyrealizing the full potential of their own inspirations.

Page 4: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

O N T H E C O V E R

2222

E M B E D D E D A P P L I C A T I O N S

4848S Y S T E M D E V E L O P M E N T

1414

E M B E D D E D A P P L I C A T I O N S

Next-Generation Wireless Standards Using Virtex-5 FXT DevicesLTE baseband reference design implements hardware, software and high-speed off-chip comms for wireless baseband processing on single device.

6161P L AT F O R M S , P R O C E S S O R S , & T O O L S

Building Linux Platforms on Xilinx Processors Find out how a LinuxLink subscription accelerates your Linux development and mitigates project risks.

FPGA Coprocessors for Spacecraft Image ProcessingC-to-FPGA design techniques speed development of performance-accelerated space-based imaging applications.

Designing Multiprocessor SOCsUsing Xilinx EDK tools and IP, you can scale your applications by designing SOCs with multiple processors.

Embedded Processing Innovations with Virtex-5 FXT FPGAsGet ahead of the embedded system design curve by maximizing performance and throughput with the built-in PowerPC 440 processor block.

88Cover

Page 5: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

S E C O N D Q U A R T E R 2 0 0 8 , I S S U E 6 4 Xcell journalXcell journal

VIEWPOINTS

Letter from the Publisher: The Next Wave in Embedded Processing Innovation..........................3

Xilinx Welcomes New President & CEO ...............................................................................7

Power.org: Presenting Power Architecture Technology..........................................................20

FEATURES

Cover

Embedded Processing Innovations with Virtex-5 FXT Devices..................................................8

Embedded Applications

Implementing Next-Generation Wireless Standards Using Virtex-5 FXT Device Parts .................14

Developing FPGA Coprocessors for Performance-Accelerated Spacecraft Image Processing ............22

Using the MicroBlaze Processor to Develop Speed Sensors for F1 Racing Cars ........................28

FPGA-Based Controller Enables Precise Chemical Analysis .....................................................32

Reconfiguring the Battlespace..........................................................................................36

System Development

Hardware/Software Optimization of Embedded Systems.....................................................41

Custom Processors in FPGAs ............................................................................................45

Designing Multiprocessor SOCs ........................................................................................48

Accelerating Video Development on FPGAs Using the Xilinx XtremeDSP Video Starter Kit.....................................................................52

Platforms, Processors, & Tools

Microprocessor Report : MicroBlaze v7 Gets an MMU..........................................................55

Building Linux Platforms on Xilinx Processors with LinuxLink ................................................61

Quickly Build Distributed Systems.....................................................................................65

Debugging Systems with FPGA Embedded Processors..........................................................68

A Complete Debugging Solution for Xilinx Embedded Processors ...........................................72

Reduce FPGA Design Time with PICO Express FPGA.............................................................75

Design Embedded Systems with Xilinx and Synplicity..........................................................78

RESOURCES

Embedded Processing Application Notes and Reference Systems...........................................81

Embedded Processing QuickStart! ....................................................................................82

Page 6: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

EmbeddedUnlock your future

Enter the New Era of ConfigurableEmbedded Processing

Adapt to changing algorithms, protocols and interfaces, by creatingyour next embedded design on the world’s most flexible systemplatform. With the latest processing breakthroughs at your fingertips,you can readily meet the demands of applications in automotive,industrial, medical, communications, or defense markets.

Architect your embedded vision• Choose MicroBlaze™, the only 32-bit soft processor with a configurable

MMU, or the industry-standard 32-bit PowerPC® architecture• Select the exact mix of peripherals that meet your I/O needs, and stitch

them together with the new optimized CoreConnect™ PLB bus

Build, program, debug . . . your way• Port the OS of your choice including Linux 2.6 for PowerPC or MicroBlaze• Reduce hardware/software debug time using Eclipse-based IDEs

together with integrated ChipScope™ analyzer

Eliminate risk & reduce cost• No worry of processor obsolescence with Xilinx Embedded Processing

technology and a range of programmable devices• Reconfigure your design even after deployment, reducing support cost

and increasing product life

Order your complete development kit today, and unlock the futureof embedded design.

©2008 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

www.xilinx.com/processor

At the Heart of Innovation

Get the Complete Embedded Solution

Linux 2.6

www.xilinx.com/processor

Page 7: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

On January 7, 2008,Xilinx announced theappointment of a newpresident and CEO,Moshe Gavrielov. Heassumes the role fromWim Roelandts, oneof the most reveredCEOs in the industry.

Roelandts remains chairman of the board,while Gavrielov becomes only the third Xilinx®

CEO in the company’s 24-year history. In thisissue’s View from the Top, Moshe provides aglimpse into his vision for Xilinx and how heplans to get us there.

Let me start by saying how honored andexcited I am to assume responsibility forleading one of the most highly respectedsemiconductor companies in the world.This is truly my dream job and I feel I’mjoining the company at a very excitingtime. We have a great opportunity to bringthe benefits of Xilinx technology to abroader range of industries and applica-tions in the coming years.

FPGAs are becoming more and morerelevant to a wider range of designers. It’sa well-known fact that the best way fordesigners to prepare for change is to armthemselves with options. The FPGA hasalways provided these options through theprogrammability of its hardware. We arethe ultimate providers of faster time tomarket. But it’s not just about silicon andflexibility – it’s about the IP as well. Inthat way, Xilinx today is in a similar tran-sition to one I have faced before: movingfrom a supplier of blank gates to being asupplier and supporter of the IP that goesinto the gates. But supporting a body ofIP in the field is a non-trivial undertaking.

I see a three-fold opportunity for Xilinxto better serve our customers while achiev-ing higher revenues and increased marketshare. First, we must maintain our pace inexpanding the underlying capabilities of oursilicon. Second, we must continue to buildour portfolio of IP across a growing breadthof applications. And third, we must contin-ue to invest in our development tools sothat we are able to serve the diverse needs ofa growing, increasingly specialized commu-nity of traditional and new users.

Xilinx is already on the path to becom-ing a complete solutions supplier. But wehave a ways to go before we get there. Weneed to learn from our customers acrossthe globe, developing an intimate under-standing of their pain points as if they wereour own. Then we need to provide solu-tions to make their pain disappear.

On DSP and EmbeddedXilinx already had its eye on the DSP andembedded markets when I arrived. Threeyears ago, the company announced the for-mation of a new division to broaden thereach of Xilinx FPGAs into multi-billiondollar vertical markets previously thedomain of of ASICs and ASSPs, includingaudio, video and broadcast, industrial

automation, aerospace and defense, med-ical, automotive, and consumer electronics.

FPGAs offer a compelling value propo-sition and significant technology advan-tages for both high-performance DSP andembedded applications. Already, DSP andembedded processors have opened anentirely new set of opportunities forXilinx. By integrating key functionalitiesinto a single FPGA, designers can reducetotal system cost, power, and board spaceby reducing component count. Simple,right? Not so simple actually, which leadsme back to my point on becoming a totalsolutions provider. Every designer – par-ticularly those coming from the DSP orembedded space – will need more thanjust FPGAs to make the proverbial leapfrom an ASIC or ASSP.

Xilinx must provide all of the necessarytools, including IP, software tools, anddesign methodologies that allow designersto leverage the high-performance, flexiblecapabilities of the FPGA. Oh yeah – andwe need to make that leap as compellingand simple as possible. I, for one, am look-ing forward to doing just that.

For the latest in Xilinx solutions designedto meet the needs of the DSP and embeddeddesigner, visit www.xilinx.com.

Second Quarter 2008 Xcell Journal 7

Xilinx Welcomes New President & CEOWim Roelandts Passes the Torch to Industry Veteran Moshe Gavrielov

Viewfrom the top

Page 8: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

8 Xcell Journal Second Quarter 2008

Embedded Processing Innovations with Virtex-5 FXT Devices

Embedded Processing Innovations with Virtex-5 FXT Devices Maximizing performance and throughput utilizing the built-in PowerPC 440 processor block. Maximizing performance and throughput utilizing the built-in PowerPC 440 processor block.

C O V E R S T O R Y

Page 9: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Craig Abramson Product Marketing Manager Xilinx, Inc. [email protected]

Dan IsaacsDirector of Embedded Processor MarketingXilinx, [email protected]

Ahmad AnsariPrincipal EngineerXilinx, [email protected]

With the advent of the Xilinx® Virtex™-5FXT FPGA, you have an opportunity toget ahead of the embedded system designcurve. The need to quickly develop and val-idate embedded systems has never beenmore apparent than in the realm of embed-ded system design.

Combining software and hardware todemonstrate this at a system level (as quick-ly as time permits) has become common-place in the industry. By providing a moretightly coupled, flexible, scalable solution,you have a means to address many hard-ware and software SOC design challenges.

FPGAs provide a significantly faster pathfor designers to rapidly develop, prototype,and test their embedded designs. TheVirtex-5 FXT device platform, the third-generation FPGA to feature a PowerPCprocessor, has added an embedded blockthat will help you meet more demandingdesign requirements while allowing you tofinish your designs quickly and easily.

In this article, we’ll provide a detaileddescription of the embedded processinginnovations in the PowerPC 440 processorblock and system interconnect. A key area offocus in the Virtex-5 FXT FPGA processorblock is simplification through integration.

A corollary to this is ease of develop-ment and test. Quickly bringing up a sys-tem to allow software developers to get ahead start on actual hardware is a majoremphasis for the Virtex-5 FXT device’sPowerPC 440 processor.

Simplification Through IntegrationIntegration is key. We have reduced theamount of FPGA logic needed to build a

al enhancements in the Virtex-5 FXTdevice’s embedded processor block. Thisresults in an overall system cost reductionand invariably a more tightly integratedprocessor system.

The processor buses only take up threeof the five “crossbar master” ports on the 5x 2 crossbar (see Figure 1). The crossbarincludes two additional master ports,because in many real-world applications it’s

not just the processor that needs access tomemory or peripherals. Each of these“crossbar master” ports comprises a proces-sor local bus (PLB) slave interface, as wellas two channels of scatter-gather directmemory access (DMA).

The “slave” side of the crossbar com-prises two ports. One port is a dedicatedmemory controller interface that provides ahigh-throughput generic interface to softmemory controllers. The other is a bus forattaching I/O devices and peripherals.

A Better ProcessorProviding all of this extra functionality inthe embedded processor block would be oflittle consequence if there were not a proces-sor with the horsepower to take advantage ofit. The Virtex-5 FXT FPGA represents thefirst time anyone has embedded a PowerPC440-class processor in an FPGA.

high-performance processing system whilestill allowing a wide variety of topologies.You still have the flexibility and advantagesof an FPGA-based implementation, butyou now also have the added benefit of ahardened, integrated interconnect architec-ture that (among other things) maximizesaccess to external memory.

As you will see, the result is an embed-ded block that allows you to develop a

wider range of high-performance processingarchitectures in a shorter period of time.

PowerPC processors generically havethree interfaces: instruction read, data read,and data write. In previous Virtex devicearchitectures, which embedded thePowerPC 405, these processor buses wouldconnect to FPGA fabric. The timing clo-sure requirements of this circuitry wouldvary based on how many and what types ofloads the design presented to the buses.

In the Virtex-5 FXT FPGA (where theprocessor is now the PowerPC 440), thesebuses are hardened and hooked directlyto a new structure, an integrated 5 x 2crossbar switch – generically referred to asthe crossbar. This hardened interconnectprovides significantly higher performance(with virtually no consumption of FPGAlogic resources and fixed timing) whencombined with the rest of the architectur-

Second Quarter 2008 Xcell Journal 9

PLB Slave/DMA

Processor Instruction Read

PLB Master

CrossbarSlaves

CrossbarMasters

Memory Controller Interface

Processor Data Read

Processor Data Write

PLB Slave/DMA

Figure 1 – The crossbar

C O V E R S T O R Y

Page 10: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

The PowerPC 440 offers a significant per-formance improvement over the PowerPC405 (which was embedded in previous Virtexfamilies) in a number of areas.

First, the PowerPC 440, when in thefastest speed-grade FPGA, can be clockedat 550 MHz. The PowerPC 405 topped outat 450 MHz. This is almost a 20% per-formance improvement. But add to thatthe fact that the I and D cache sizes aredoubled, the instruction pipeline is seveninstead of five stages, and the executionunit can now execute two instructions outof order and in parallel.

The result? You’ve got a processor withperformance sufficient to handle a great manyof today’s embedded processing challenges.

There are a number of other advantagesto moving from the PowerPC 405 to thePowerPC 440, as shown in Table 1.

The PowerPC 440 embedded block isshown in Figure 2.

High-Throughput Switch MatrixThe 5 x 2 crossbar is more than just a bigswitch. It provides non-blocking pipelinedaccess from the five crossbar masters to thetwo crossbar slaves (see Figure 1). It allowsconcurrent transfers between differentagents on the crossbar at the same time.

As shown, we’ll call the buses going intothe crossbar “crossbar masters” and thebuses coming out “crossbar slaves.” Theseinterfaces are highly pipelined, thus allow-ing a large number of transactions to be inprogress at the same time.

In fact, up to four concurrent transac-tions can exist: two for each crossbar slave(such as the memory controller or PLBmaster). Additionally, each crossbar mas-ter (that is, the three processor PLBs andthe two PLB slave interfaces) can pipelinefour read and four write transactions tothe same slave.

Another key feature of the crossbar isits highly programmable memory map-ping. You can think of the entire systemof having available memory space of 4GB. Both the memory controller inter-face and the PLB master can have differ-

ent memory windows mapped into thememory space of any of the crossbar mas-ters. These memory spaces can be pro-grammed through the FPGA bitstream,by the processor at run time, or even byexternal logic on the FPGA using thecrossbar’s sideband bus, called the devicecontrol register (DCR) bus.

Integrated PLB InterfacesAs we mentioned earlier, many of the busesconnected to the crossbar are processorlocal buses, also called PLBs.

The PLB is one of the standardCoreConnect buses as defined by IBM.An earlier version of the PLB (version 3.4)was used as one of the standard buses onPowerPC 405 designs in Virtex-II Proand Virtex-4 FX FPGAs and is also usedin the new PowerPC 440 embeddedprocessor block.

In the PowerPC 440 embedded proces-sor block, the PLBs connect the processor’sinternal caches to the input side of thecrosspoint. The buses are:

• ICURD: instruction cache unit read

• DCURD: data cache unit read

• DCUWR: data cache unit write

10 Xcell Journal Second Quarter 2008

I

R

W

PowerPC

440

Processor

DCR

APU

PLB Slv0

DMA

DMA

DMA

PLB Slv1

PLB Master

Memory I/F

DMA

PPC405(Virtex-4 FX FPGA)

32-bit instruction, 32-bit address,

64-bit data

Single instruction/cycle, five-stage pipeline,

in-order issue

16K/16K, two-way set associative, no locking

Page size: 1 KB to 16 MB

700+ DMIPS

PPC440(Virtex-5 FXT FPGA)

32-bit instruction, 36-bit address, 128-bit data,

Book E compliant

Two instructions/cycle,seven-stage pipeline,

out-of-order issue

32K/32K, 64-way setassociative, locking

Page size: 1 KB to 256 MB

1000+ DMIPS

Benefit

Access more physical memory, higher speed

data movement

More efficient instructionexecution

Less memory access latency

Less page swapping

Better benchmarks equalhigher performance

Architecture

Pipline

Caches – I/D

MMU

DMPS Estimate

Figure 2 – The PowerPC 440 embedded processor block

Table 1 – Virtex PowerPC integrated processor feature comparison

C O V E R S T O R Y

Page 11: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Second Quarter 2008 Xcell Journal 11

The PLB used in the Virtex-5 FXT deviceis version 4.6 (PLB46). The PLB46 bus archi-tecture brings with it a number of new capa-bilities that give it a nice performance boostover its predecessor. The most obvious is thefact that while PLB34 was 64 bits, PLB46 is128 bits. But not to worry – if the IP con-nected to the bus is less than that, the bus willperform dynamic bus sizing as required toaccommodate 32- and 64-bit transactions.

We should also point out that the PLB46version is a Xilinx implementation of theIBM-defined PLB46, optimized to takeadvantage of FPGA resources.

PLB46 – and indeed all versions of PLB– have the concept of master and slave. Thisshould not be confused with crossbar mas-ter and crossbar slave. (Again, refer to Figure1.) As we stated earlier, there are two PLBslave port interfaces on the crossbar; theyare crossbar masters. These slave ports areconnected to the FPGA fabric.

In a processor system there is often theneed to allow something besides the proces-sor to access external memory or on-chipperipherals. The PLB slave interfaces allowjust that. PLB masters, built from FPGAlogic, connected to the PLB Slave ports (seeFigure 3) can access either the MCI or theMPLB through the crossbar.

Similarly, the function of the PLB master(the one that is the crossbar slave) is to havea PLB to hook to I/O devices and softperipherals. And because the PLB master isa crossbar slave, anything hooked to a cross-bar master port can access it.

Note that there can be no more thanfour PLB masters connected to each PLBslave bus. Few systems are likely to needmore than four masters, but if you did needmore, you could always use the PLB/PLBbridge IP provided with the EmbeddedDevelopment Kit (EDK) (see www.xilinx.com/support/documentation/ipembedprocess_coreconnect_plbbusstruct.htm).

Figure 3 is a simplified system diagramshowing how PLB peripherals can behooked to crossbar master and crossbarslave ports. Note that if you have multiplemasters on any PLB, arbitration is handledby the IP for the bus. You do not need aseparate arbiter.

Optimized DMA EnginesThere are four additional crossbar masters;they are the four DMA channels. EachDMA channel has separate 32-bit transmitand 32-bit receive interfaces. They sharecrossbar arbitration with PLB slave inter-faces, as shown in Figure 4.

All DMA ports can be operating at thesame time. Each one has a dedicated FIFO,so as one DMA is accumulating data, theother DMA can be pumping data throughthe crossbar. Each DMA channel operatesasynchronously to the processor clock.

The interface into the DMA channels isthrough an interface called LocalLink.Xilinx uses the LocalLink interface in a

MCI

MPLB

PLB

Slave

Device

Soft

Memory

Controller

PLB

Slave

Device

External DDR2

Memory

PLB Slave

PLB Slave

ICURD

DCURD

DCUWR

PLB

Master

Device

PLB

Slave

Device

PLB

Master

Device

PLB

Slave

Device

PPC440

PLB Slv1

DMA

DMA

MPLB

Pro

gra

mm

ab

le

Arb

itra

tio

n

PLB Slv0

DMA

DMA

Pro

gra

mm

ab

le

Arb

itra

tio

n

MCI

PLB Slave128

PLB Slave128

Local Link RX32

Local Link TX32

Local Link RX32

Local Link TX32

Local Link RX32

Local Link TX32

Local Link RX32

Local Link TX32

Figure 4 – PLB slavebuses and DMA channels share crossbar arbitration.

Figure 3 – PLB masters and slaves

C O V E R S T O R Y

Page 12: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

number of IP blocks. LocalLink is a point-to-point interface that sends packets to, orreceives packets from, some external device.

The most notable type of processor IPthat uses the LocalLink interface is the hardembedded tri-mode Ethernet media accesscontroller (TEMAC) block. The TEMAChas a wrapper that allows it to communicatedirectly with the PowerPC 440 DMA.

Although all data paths through thecrossbar are 128 bits, the LocalLink inter-face into and out of the DMA channels areall 32 bits. As such, there is built-in logicbetween the DMA controller and the cross-bar that realigns data.

To maximize throughput and perform-ance, the PowerPC 440 embedded blockemploys scatter/gather DMA. To makeusing this capability as easy as possible,Xilinx provides wrappers for the variouspieces of IP and embedded blocks it offers.

The first one targeted specificallytoward the PowerPC 440 is the soft wrap-per for the embedded TEMAC blocks.

This wrapper, combined with the func-tionality of the DMA engine in thePowerPC 440 embedded block, allowsyou to easily build a processing systemwith a high-performance TEMAC con-nected directly to the PowerPC 440 DMAchannels. Figure 5 is a simplified systemshowing how both DMA and PLB periph-erals can be hooked to crossbar masterand crossbar slave ports.

The DMA channels are controlled bydescriptors, small blocks of memory that areset up by the PowerPC 440 processor beforecommencing DMA operations. The descrip-tors control how much data is transferred andwhere data is located in system memory.

Descriptors can be chained together ifneed be, effectively creating a sequence ofcommands to control a DMA channel.The DMA controller is covered in itsentirety in the reference guide, entitled“Embedded Processor Block in Virtex-5FPGAs” (http://www.xilinx.com/support/documentation/user_guides/ug200pdf).

High-Performance Dedicated Memory InterfaceRounding out the new processor block is thededicated memory controller interface. Thepurpose of this interface is to provide a ded-icated link out to external memory, but atthe same time not be tied to any specificmemory technology.

At this time, the memory controllerinterface supports a stand-alone DDR2 con-troller and MPMC4 controller, all availablethrough Xilinx Platform Studio, EDK 10.1.This interface provides the flexibility to con-nect to virtually any memory technologynow or in the future.

The memory controller interface isstreamlined, comprising address/data/control. It can be programmed to support128-, 64-, 32-, or even 16-bit memory. Itdoes width and burst realignment, so whilethe DMA may be bursting one size packet,the memory controller can buffer andrealign the packet data to maximize thebandwidth to the memory. Burst size isprogrammable and can be 1, 2, 4, or 8,and the memory controller interface willautomatically adjust the address to accom-modate the various burst widths.

The majority of soft memory controllerhandshaking signals are generated by theinterface on behalf of the memory con-troller. They are provided ahead of time suchthat the soft memory controller can generatethrottling signals back to the memory inter-face. The memory controller interface – onbehalf of the soft memory controller – canalso be programmed to detect bank and rowmisses ahead of time and will inform the softmemory controller to anticipate a bank orrow miss. All of these features together pro-vide a solution whose primary goal is tomaximize memory throughput.

Tuning the SystemIn some situations, a PLB or DMA interfacejust may not be the right solution. For

12 Xcell Journal Second Quarter 2008

PLB SlaveandDMA

PLB Slave

MCI

MPLB

ICURD

DCURD

DCUWR

Local LinkInterfaces

PLB

Master

Device

PLB

Slave

Device

PLB

Master

Device

PLB

Slave

Device

Hard

TEMAC

Wrapper

Hard

TEMAC

Wrapper

PLB

Slave

Device

Soft

Memory

Controller

PLB

Slave

Device

ExternalDDR2

Memory

PPC440

To maximize throughput and performance, the PowerPC 440 embedded block employs scatter/gather DMA. To make using this

capability as easy as possible, Xilinx provides wrappers for the various pieces of IP and embedded blocks it offers.

Figure 5 – Sample system with both PLB and DMA peripherals

C O V E R S T O R Y

Page 13: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Second Quarter 2008 Xcell Journal 13

instance, you might find that you have asoftware algorithm that takes too manycycles to execute and is affecting your sys-tem bandwidth. That algorithm is a greatcandidate for implementation in hardware,and the interface to which you may wantthat hardware connected would be the aux-iliary processing unit, or APU interface.

The PowerPC 440 has a second-genera-tion APU interface that is tightly coupledto the execution units of the processor. Theinterface is controlled by 16 user definedinstructions (UDIs). The data path of theAPU interface is 128 bits.

Perhaps the most common use of theAPU interface is for connecting to a float-ing-point unit (FPU). The FPU isIEEE754-compatible and supports bothsingle- and double-precision operations forthe PowerPC 440.

The FPU is implemented in the FPGAsoft logic fabric and utilizes the DSP48Eblocks. The soft logic implementationoperates up to half the frequency of thehard embedded processor.

Other uses of the APU interfaceinclude hardware algorithm acceleration,as well as an alternative high-bandwidthlink to block RAM.

Configuring the Embedded BlockBy integrating the PowerPC 440 block inthe FPGA, the processor block can be con-

figured in multiple ways. Virtually everyinterface is programmable.

For example, when you build your pro-cessing system in the Xilinx PlatformStudio development environment and abitstream is created, all of the specificationsof the processing system are in the bit-stream. Thus, when the FPGA starts up,your processor is up and running.

Now, let’s say the processing system is upand running and you want to modify theoperation of one of the DMA channels. Youwould do that through the DCR interface.There are DCR registers to control everyaspect of DMA operation.

In fact, there is DCR access to virtuallyevery other subsystem of the embeddedblock: the PLBs and crossbar, memory con-troller interface, and the APU controller.Refer to Figure 2 for more details.

Putting It All TogetherThis innovation would be for naught ifXilinx did not provide a comprehensive infra-structure to take advantage of all of the archi-tectural enhancements. We should point outthat the Virtex-5 FXT FPGA with thePowerPC 440 block represents our eighthyear in embedded processing and our third-generation FPGA with a hardened processor.

Throughout that time we’ve been con-stantly updating EDK, our award-winningEmbedded Development Kit. EDK

includes Platform Studio, with its compre-hensive library of IP for hardware design,and Platform Studio SDK, a softwaredevelopment environment familiar tomany embedded software engineers.

With the introduction of the Virtex-5FXT family of devices, we continue to fur-ther strengthen our third-party alliances withsupport from industry-leading operating sys-tem providers, including WindRiver Systemswith VxWorks and Green Hills Integrity.

Linux support is provided throughLynuxWorks, Monta Vista, and WindRiverSystems. In addition, Xilinx recognizes theimportance of open-source Linux, andwe’re moving forward on that front.

Xilinx and its partner companies arealso developing a wide variety of boards.Xilinx has multiple boards for the Virtex-5FXT device: the ML507 with theXC5VFX70T and the ML510 with theXC5VFX130T, as shown in Figure 6. TheML507 evaluation platform enables yourteam to quickly begin developing hard-ware, software, or both. When multipleprocessors or a motherboard-type platformare required, the ML510 with the dual-processor XC5VFX130T is ideal.

ConclusionA high-performance processing solutionwith optimized data throughput is high onthe wish list of embedded designers every-where. This is true whether you are run-ning critical algorithms at the heart of thelatest wireless base station, switching high-bandwidth data through a video switch,performing advanced signal processing forguidance systems using coprocessor accel-eration, or handling complex control andsystem management tasks.

The Virtex-5 FXT embedded processorblock, with a multi-ported non-blockingintegrated processor interconnect andhigh-performance integrated DMA, offersa solution that allows you to focus on thekey elements of your embedded design.

With a virtually unlimited number ofways to harness these embedded capabili-ties, the Virtex-5 FXT FPGA embeddedprocessing solution provides a highly inte-grated platform for high-performance,high-throughput SOC designs.

Virtex-5 FXT Development PlatformsJump-Start Your Design

Single PPC Dual PPC

ML507 Evaluation Board• XC5VFX70T-1FF1136C• PCie x 1 Plug-In or Stand-Alone

ML510 Development Board• XC5VFX130T-1FF1738C• ATX Form Factor

Figure 6 – Xilinx ML507 evaluation and ML510 development boards

C O V E R S T O R Y

Page 14: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Rob Payne Staff Engineer Xilinx, [email protected]

The next generation of the 3GPP wirelessstandard is called long-term evolution(LTE). It provides a leap in performanceand a complete move to packet-based pro-cessing. In the physical (PHY) level of theLTE specification, specific challenges existwhen dealing with higher data throughputrates, as well as the move to OFDM(orthogonal frequency-division multiplex-ing) for transmission.

Xilinx has developed – or is in theprocess of developing – several new orrevised DSP LogiCORE™ solutions tomeet the demands of the new specification.With such blocks it is critical not only toverify them as stand-alone blocks, but alsoto validate them in real systems with real-world data. The Xilinx 3GPP downlink ref-erence design provides this validation, aswell as providing a reference to customersabout how to use the blocks.

14 Xcell Journal Second Quarter 2008

Implementing Next-GenerationWireless Standards Using Virtex-5 FXT Device Parts

Implementing Next-GenerationWireless Standards Using Virtex-5 FXT Device PartsThe Xilinx LTE baseband reference design shows how the Virtex-5 FXT deviceenables hardware, software, and high-speed off-chip comms for wireless baseband processing – implemented on a single device.

The Xilinx LTE baseband reference design shows how the Virtex-5 FXT deviceenables hardware, software, and high-speed off-chip comms for wireless baseband processing – implemented on a single device.

EMBEDDED AP P L ICAT IONS

Page 15: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

The higher data rate in LTE placesincreased processing demands on all partsof the system: increased DSP hardwareprocessing in the baseband, increased soft-ware processing to implement the higherlayers of the UMTS protocol stack, andincreased I/O communication bandwidthto accept packets and pass data to remoteradio-heads.

In this article, I’ll review some of the newfeatures of the LTE specification and howXilinx® Virtex™-5 FXT devices address theincreased processing demands of LTEthrough its tight integration of microproces-sor subsystem, DSP-enhanced FPGA fabric,and high-speed communication links.

The 3GPP LTE Physical LayerOne of the key changes in the Layer 1(PHY layer) of the 3GPP LTE is thechange from CDMA (code-division multi-ple access) to OFDM (orthogonal fre-quency-division multiplexing). One of themain benefits of OFDM is that it reducesthe problems associated with multiplepaths in the radio channel. In CDMA, asignificant amount of processing must bedevoted to characterizing and tracking theradio channel to compensate for the effectsof fading in the channel.

Figure 1 illustrates the structure of anexample LTE subframe. The subframecomprises a number of OFDM symbols.Each OFDM symbol provides the datainput for an inverse fast Fourier transform(IFFT). In LTE this may be as many as2,048 input points for in-phase (I) andquadrature (Q) components.

A subframe can be represented as aresource grid, where each resource elementin the grid comprises a single I/Q inputpoint for the IFFT in an OFDM symbol.Resource grids can be layered to providedata to multiple antennas, supportingtransmission schemes such as transmitdiversity or MIMO (multiple input/multi-ple output) techniques.

The resource grid is allocated to differ-ent purposes. Resource elements are allocat-ed to control channels, data channels, andsynchronization signals. The diagram alsoshows the packetization of data on thechannel – different areas of the resource

3GPP LTE Downlink Processing Figure 2 shows the processing stages in thebaseband section of the 3GPP LTE down-link. The processing for both transmit andreceive can be split into two main sec-

grid are allocated to different users’ data asresource blocks. The task of scheduling datatransmission and allocating resource blocksto users is performed by the higher softwarelayers in the LTE stack.

Second Quarter 2008 Xcell Journal 15

1 subframe = 1 ms = 14 OFDM Symbols

0

1

2

3

4

5

7

8

9

11

12

6

13

Time

Frequency

Sub-Frequency = 15 kHz

User 1 Resource Blocks

User 2 Resource Blocks

User 3 Resource Blocks

Synchronization Signal

Control Channel

KEY:

Resource Grid

Resource Block is a 12 x 14block of resource elementsin a subframe.

10

Figure 1 – LTE uses OFDM. A subframe comprises a resource grid, with areas allocated to control, synchronization, and user data. Each column of the

grid forms an OFDM symbol that is converted to the time domain by an IFFT.

EMBEDDED AP P L ICAT IONS

Page 16: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

tions: symbol rate processing and samplerate processing.

Symbol-rate processing is centeredaround forward error correction, used toadd redundancy to the data stream in abandwidth-efficient manner and to recov-er data on the receiver. Sample-rate pro-cessing is centered around the IFFT/FFTthat performs the OFDM part of thebaseband operation.

Transmit Symbol Rate ProcessingThe first stage in the Layer 1 (PHY) process-ing of the LTE downlink takes transportblocks from the media access controller(MAC) layer. Transport blocks have cyclicredundancy checks (CRCs) added, whilelarger transport blocks may be segmented toensure that blocks do not exceed a maximumsize supported by the forward-error encoder.

Each segment then has a CRC addedbefore it is supplied to the forward-errorencoder: for data channels this is a turboencoder and for control channels this is aconvolutional encoder. Following the encod-ing, rate matching is applied to the output topuncture the data so that it will fill the avail-able resource blocks in the OFDM resourcegrid. Finally, the data stream is modulatedwith a specified modulation (QPSK,QAM16, or QAM64) to give a sample valueto go in the OFDM resource grid.

Transmit Sample Rate ProcessingSamples are first mapped to differentantenna layers. This mapping allows sup-port for schemes such as transmit diversi-ty (multiple transmit paths to reducenoise at receiver) or MIMO (MIMOtechniques utilize multiple channels

between transmitter and receiver toincrease data rates).

The next step is to map samples toresource elements in the OFDM resourcegrid. This stage also adds synchronizationsignals to the resource grid, allowing thereceiver to synchronize with the transmitter.The output of this stage is passed to anIFFT. The IFFT takes the samples from thefrequency domain to IQ signals in the timedomain, which are ready to be sent to theradio-head for transmission. OFDMrequires a cyclic prefix (CP) added to thedata, which repeats the end of the time-domain signal at the start of the output. TheCP size is determined by the size of themobile cell and reflections. A sufficientlylarger CP is required to remove multi-patheffects from the OFDM symbol.

Receive Sample Rate ProcessingThe receive processing generally follows theinverse of the transmit processing. The firststep is to do an FFT on the incoming datato convert the time-domain signal back tothe frequency domain. In the referencedesign, data is received across all OFDMsub-frequencies for all users, but a realmobile user would only decode data on theresource blocks that it had been allocated.Synchronization at this step is also per-formed to synchronize the system to thestart of each sub-frame in the OFDM data.The output of the FFT is processed througha layer demapper that inverts the layer map-ping in the transmission.

Receive Symbol Rate ProcessingThe first step in the receive processing is totake modulation symbols and convertthem to individual bits. For turbo-encodeddata, this is a set of probabilities in theform of logarithmic-likelihood-ratios forthe turbo decoder. For convolutionallyencoded data, it is a distance metric that isthen fed to a Virterbi decoder. The outputof the error correction is checked for CRCvalidity and reassembled into the originaltransport blocks.

The LTE Baseband Reference DesignLTE has required a number of new orrevised Xilinx LogiCORE solutions to meet

16 Xcell Journal Second Quarter 2008

Downlink TX

IFFT

+ CP Insertion

Resource

Mapping

Sync Insertion

Layer Mapper

Rate Matcher

Turbo Encoder /

Convolutional

Encoder

CB CRC Insertion

TB Segmentation

TB CRC Insertion

Transport Blocks

Codeblocks

QAM Modulation

Punctured Codeblocks

OFDM Symbols

I/Q Samples (Time Domain)

Sample RateProcessing

Symbol RateProcessing

Antenna I/Q Samples

I/Q Samples (Freq Domain)

Segmented Blocks

Segmented Blocks + CRC

Transport Blocks + CRC

Downlink RX

FFT

+ Synchronization

Resource

Mapping

Layer Demapper

Turbo Decoder /

Convolutional

Decoder

CB CRC Checking

TB Reassembly

TB CRC Checking

Transport Blocks

LLR Generation

Receive Likelihoods

OFDM Symbols

Antenna I/Q Samples

I/Q Samples (Freq Domain)

Segmented Blocks

Segmented Blocks + CRC

Transport Blocks + CRC

Figure 2 – 3GPP LTE downlink processing: transmit and receive chains

EMBEDDED AP P L ICAT IONS

Page 17: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Second Quarter 2008 Xcell Journal 17

the new requirements of the specification.For example, the turbo encoder uses LTE-specific interleavers; the IFFT requires theaddition of a variable-length cyclic prefix;and the turbo decoder requires a far higherthroughput than previous 3GPP standards.

Verifying these individual blocks pres-ents a number of challenges. First, the LTEstandard is still changing and has not beenratified. Second, the size of OFDM symbol(potentially 2,048 points) and the length ofthe sub-frame and turbo-decoder iterationsrequired mean that a large number of simu-lation cycles are required to verify thebehavior of even a single transport blockbeing decoded. This can limit the test cov-erage achievable in simulation. Finally, justrunning unit tests on blocks does not testthe macroscopic behavior of systems such astransport-block error rates, or validate thatblocks have compatible interfaces.

To address these issues, the design teamdecided to implement an LTE referencesystem design that would provide system-level validation of the new LTE LogiCOREIP using real-world data sources such asvideo streams.

Since the main aim of the reference sys-tem was to validate new IP LogiCOREsolutions, we wanted to minimize the

amount of additional design work for thereference design. We also wanted to mini-mize the system integration and tool issuesand use off-the-shelf boards and IP blocksas much as possible.

Previous reference systems for WCDMARelease 6 of the 3GPP specifications hadused a separate DSP microprocessor boardcoupled with an FPGA board. This meantusing different tool flows for software andhardware design, plus the overhead ofdesigning interposer boards to connect themain boards together.

For the LTE baseband reference design,we chose to avoid these issues by gainingearly access to the Xilinx Virtex-5 FXTdevice silicon. The FXT parts integrate on asingle chip a PowerPC 440 microprocessor,DSP-enhanced FPGA fabric, and high-speedGTX off-chip communication blocks. Thisdecision minimized our system integrationproblems (as all the key components were ona single chip) and allowed us to use the stan-dard customer design board, the ML507.

In addition, the design simplified ourtool flow. The top level of the system wasintegrated in Xilinx Platform Studio (XPS),which allows a large number of pre-verifiedblocks to be pulled into the system from theXilinx Embedded Development Kit

(EDK). XPS was the framework to pull inthe various LogiCORE systems that neededvalidating, plus additional blocks that wereimplemented in a mixture of VHDL andXilinx System Generator for the DSP por-tions of the design.

Implementing on the Virtex-5 FXT FPGAFigure 3 shows a top-level block diagram ofthe LTE baseband reference design. Thesystem shown consists of two ML507s,which act as IP packet bridges between theGigabit Ethernet and LTE protocol stacks.

The reference application is videostreaming using the open-sourceVideoLan Server. The VideoLAN serverruns on the transmitter PC, which com-municates to the system over the GigabitEthernet link. Video is received by theVideoLAN client on the receive PC. Acontrol GUI also runs on the PC, allow-ing you to set up the LTE blocks to bechanged and monitored remotely.

Video packets are transmitted throughthe LTE software driver and into the LTEdownlink transmission blocks previouslydescribed. The output I/Q data from theLTE downlink is then sent out over anAurora link. Aurora was chosen initiallyover dedicated base station standards suchas CPRI/OBSAI protocols because of itsearly availability on FXT parts.

The I/Q data is received on the receiveML507 board and passed into the LTEdownlink receive chain. Processing andcommunication follows an inverse order tothat for the transmission, and finally deliv-ers video packets to the VideoLAN clienton the receiving PC.

Figure 3 is color-coded to highlight howthe different features of the FXT parts areused. The software elements of the systemthat run on the PowerPC 440 are highlight-ed in dark blue. DSP functions comprisingthe LTE downlink transmit and receive pro-cessing are in orange and high-speed I/Oblocks are in light blue. New or revisedLogiCORE solutions are shown in yellow.

We found we could implement bothtransmit and receive functionality on a sin-gle FX70T part, so by looping back the dataon the Aurora link, we could implement thesystem on a single ML507.

TX PC

VLC

Server

Windows

IP Stack

Xilinx

LWIP

IP Stack

FX70T

LLTEMAC

Aurora

GTX

Aurora

GTX

FX70T

LLTEMAC

LTE

Drivers

LTE TX

CHAIN

PDSCHEncode

LTETurbo

Decode

FFT v5

LTE RX

CHAIN

LTE

Drivers

Xilinx

LWIP

IP Stack

Ethernet

MAC/PHY

Windows

IP Stack

Ethernet

MAC/PHY

Test Application

UDP LTE Bridge

Test Application

UDP LTE Bridge

Control

GUI

VLC

Client

Control

GUI

KEY:

Virtex-5 FXT Features:

Other Features:

High-Speed I/OBlocks

Off-the-Shelf

DSP Blocks

Aurora Link Gigabit EthernetGigabit Ethernet

Ref Design GUI

PPC440 Software

New / RevisedLogiCORE Solution

RX PCRX BoardML507

TX BoardML507

Figure 3 – Xilinx LTE baseband reference design showing system integration of hardware and software blocks. Blocks are stacked into respective protocols so that higher layers use services

of lower-layer blocks. Peer-to-peer communication goes from left to right.

EMBEDDED AP P L ICAT IONS

Page 18: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Final Demonstration SystemFigure 4 shows a screenshot of the finaldemonstration system in operation. Thetransmit video stream is shown along withthe video stream received by two differentLTE mobile users after the packets weretransmitted over a noisy channel. The con-trol GUI allows variable noise in the chan-nel, along with modulation and encodingrate parameter settings for each user. Thenumber of iterations used by the turbodecode is also variable.

The demonstration allows us to exam-ine the behavior of the LogiCORE solu-tions when placed in a system withreal-world data. In the case of the videostreams shown in Figure 4, we used thesame modulation scheme (QPSK) for bothusers. However, for user 1, we startedincreasing the data encoding rate, thusreducing redundancy in the data.

As we pass an encoding rate of 0.8, wecross a threshold and start to see a significantnumber of errors in the decoded videostream at this value of SNR. These errorsappear as artifacts in the decoded videostream. In a real base station, higher layers inthe LTE protocol stack would retransmitextra redundant data for the packet. To max-

imize efficient use of the channel, the basestation has to balance the cost of retransmis-sion against the benefits of lower overall dataencoding rates, and also attempt to mini-mize latency in the system. These may feedinto quality of service (QoS) parametersused by higher layers in the LTE protocol.

ConclusionThe Xilinx Virtex-5 FXT device provides atightly coupled integration of processorsubsystem, DSP-enabled FPGA fabric, andhigh-speed communication. Such high lev-els of integration have allowed both thehardware and software elements of the LTEbaseband reference system to be integratedon a single Xilinx FX70T part using stan-dard hardware boards.

By using blocks available in the EDKtoolkit to maximize IP reuse, and usingXilinx Platform Studio as a single integra-tion framework, design teams can concen-trate on the novel parts of the LTE downlinkdesign. This has allowed rapid developmentand tracking of changes in the LTE specifi-cation as it approaches ratification.

Many thanks to other members of the LTE basebanddesign team: Beth Cowie, Graham Johnston, AndrewLaney, Neil Lilliott, Jorge Seoane, and Bill Wilkie.

18 Xcell Journal Second Quarter 2008

Figure 4 – Screenshot of demo running. The GUI allows variable modulation, encoding rate, and channel noise. Transmitted video is in the upper right-hand corner. Received video

(with time lag for software video decode buffers) is shown for two users in the bottom half of the figure.Uncorrected errors appear as artifacts in the video decode.

GetPublished

Would you like to be published

in XcellPublications?

It's easier than you think!

Submit an article draft for our Web-based

or printed Xcell Publications and we will

assign an editor and a graphic artist

to work with you to make your work

look as good as possible.

For more information on this

exciting and highly rewarding program,

please contact:

Forrest Couch

Publisher, Xcell Publications

[email protected]

EMBEDDED AP P L ICAT IONS

Page 19: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

More gates, more speed, more versatility, and ofcourse, less cost — it’s what you expect from The Dini Group. This new

board features 16 Xilinx Virtex-5 LX 330s (-1 or -2 speed grades). With over 32 Million ASICgates (not counting memories or multipliers) the DN9000K10 is the biggest, fastest, ASICprototyping platform in production.

User-friendly features include:

• 9 clock networks, balanced and distributed to all FPGAs

• 6 DDR2 SODIMM modules with options for FLASH, SSRAM, QDR SSRAM, Mictor(s),DDR3, RLDRAM, and other memories

• USB and multiple RS 232 ports for user interface

• 1500 I/O pins for the most demanding expansion requirements

Software for board operation includes reference designs to get you up and running quickly. Theboard is available “off-the-shelf ” with lead times of 2-3 weeks. For more gates and more speed,call The Dini Group and get your product to market faster.

www.dinigroup.com • 1010 Pearl Street, Suite 6 • La Jolla, CA 92037 • (858) 454-3419 • e-mail: [email protected]

Page 20: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Mike Paczan [email protected]

Power Architectureprocessing technologyis the common thread

for a very broad range of computingdevices, from 32-bit microcontrollers to64-bit ASICs. It is also found in some ofthe most sophisticated FPGAs availabletoday, including Xilinx® Virtex™-4 FXand Virtex-5 FXT FPGAs.

More than a billion Power Architecture-based chips have been built into electronicequipment since 1991, when the process-ing platform was first introduced as IBM’sPOWER (Performance Optimized WithEnhanced RISC).

Every Power Architecture-based chip isrooted in the Power Instruction SetArchitecture (ISA), a processing specifica-tion spanning server and embedded com-puting capabilities and incorporating theAltiVec vector (SIMD) processing exten-sion. The Power ISA serves as the basis forfuture advancement of this highly success-ful processing platform.

Although the Power ISA name may benew to some, its features are familiar tothousands of software, hardware, and tooldevelopers who have worked withPowerPC devices over the past 16 years.Power Architecture technology underliesmany well-known chip families, includ-ing full-featured Virtex FPGAs from

Xilinx; the Cell Broadband Engine fromIBM, Sony, and Toshiba; thePowerQUICC line of SOCs fromFreescale; the POWER5 and POWER6server chips from IBM; and, of course,hundreds of products from companies allover the world.

Silicon is just a small part of the com-plete technology platform: hundreds ofcompanies provide Power Architecture

tools, software, systems, and services tospeed up and simplify product develop-ment. In any given product category –from custom design services to real-timeoperating services – there are multipleproviders from which to choose. The prod-ucts and services these vendors offer arenot only state-of-the-art but in many caseshave also been refined by decades of expe-rience in the marketplace.

20 Xcell Journal Second Quarter 2008

Presenting Power Architecture TechnologyThe Power Architecture processing platform experiences rapid growth in unit shipments and software support.

V I E W P O I N T

Page 21: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Expanding Market OpportunitiesThe Power Architecture community’s focuson improving customers’ systems designexperience, along with our product varietyand vendor quality, has led to tremendousgrowth in the technology platform. Between2005 and 2006, unit shipments for PowerArchitecture processors grew by 61%.

Today, Power Architecture processingtechnology is used in a huge variety ofadvanced electronics. For example, theworld’s two fastest supercomputers runPower Architecture processors, as do five ofthe world’s 10 best-performing enterpriseservers. Power Architecture controllers arein more than half the new cars on the road.Practically every phone call, e-mail, andWeb page is delivered by equipment usingPower Architecture microprocessors. PowerArchitecture devices are also pervasive inconsumer game consoles, including theMicrosoft Xbox 360, Nintendo Wii, andSony PlayStation 3.

Looking forward, the communicationsinfrastructure, automotive control, aero-space, and defense market segments willcontinue to be strongholds for PowerArchitecture technology. The consumermarket – digital televisions, DVD players,set-top boxes, and particularly game con-soles – will drive strong double-digitgrowth for Power Architecture devicesthrough 2011. Xilinx Virtex FPGAs withPowerPC cores will remain a vital part ofthe Power Architecture portfolio for manyof these major markets.

Enhancing Architecture Through Power.orgDozens of influential companies provid-ing chips, tools, software, services, andsystems have joined together in Power.orgto develop open specifications and stan-dards for the Power Architecture platform.To simplify the development experiencefor system designers, Power.org’s memberswill complete a number of projects andspecifications this year:

• Searchable online product catalog cover-ing almost all silicon, tools, software,and services offerings available for thePower Architecture technology platform

• Specifications establishing a common

set of hardware debugging interfaces forPower Architecture implementations

• Publication of consolidated applicationbinary interface specifications for 32-and 64-bit Power Architecture imple-mentations with neutral licensing thatpermits wide use and reference by cus-tomers, third parties, and other interest-ed parties to support the developmentof tools, operating systems, and otherplatform technologies

• Platform requirements specification forembedded systems built on PowerArchitecture technology

• An open-source Hypervisor optimizedfor embedded systems

• New models and simulation specifica-tions that support the construction of“virtual prototypes” with PowerArchitecture processors

• Technical training for PowerArchitecture products offered onlinethrough webinars or in person at ourregional Power Architecture confer-ences in Europe and Asia

ConclusionPower.org’s open-community approach tomanaging Power Architecture technologyhas already resulted in many tangible bene-fits. For instance, in 2007, Power.org’smembers completed the Server PowerArchitecture Platform Reference (sPAPR),an open specification for streamlining thedevelopment of Power Architecture prod-ucts for Linux. This specification, built onmore than 20 IBM patents, was madeavailable to corporate members ofPower.org royalty-free. Along with thisspecification, a compliant reference designwas also released based on IBM’s 970MPprocessor. This was made available with aLinux stack, Hypervisor, and firmware.

By working together within Power.org,our member companies are continuing toimprove the product design experience forour many shared customers and to expandbusiness opportunities for the entire PowerArchitecture community.

For more information about Power.org,visit www.power.org.

Second Quarter 2008 Xcell Journal 21

FPGA Powered –

That‘s it!

FPGA-Module:TQM hydraXC – smallest, most universal

Hardware Platform for

Reconfi gurable Computing

• Based on XILINX Spartan 3,

Virtex 4 and Virtex 5 technology

• Ethernet 10/100, USB 2.0, RTC

• SPI-, NAND-Flash, DDR2 / SDRAM

• Size 2.13 Inch x 1.73 Inch

(54 mm x 44 mm)

• Programmable VCC IOs

Embedded solution with

TQM hydraXC for

• Fast time to market

• Economical series production

• Highest fl exibility

• Hardware reduction

:: Starter kits available ::

Email: [email protected]

Infoline: (+49 ) 81 53 / 93 08 - 333

Starter kit

with module

V I E W P O I N T

Page 22: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Paula J. Pingree Senior Engineer Jet Propulsion Laboratory, California Institute of [email protected]

Lucas J. Scharenbroich Staff Engineer Jet Propulsion Laboratory, California Institute of [email protected]

Thomas A. WerneAssociate Engineer Jet Propulsion Laboratory, California Institute of [email protected]

David PellerinCTO Impulse Accelerated [email protected]

Fast and accurate on-board classification ofimage data is a critical part of modern satel-lite image processing. For Earth sciencesand other applications, space-based smartpayloads make use of intelligent, machine-learning algorithms and instrument auton-omy to detect and identify naturalphenomena such as flooding, volcaniceruptions, and sea ice break-up.

22 Xcell Journal Second Quarter 2008

Developing FPGA Coprocessors for Performance-AcceleratedSpacecraft Image ProcessingC-to-FPGA design techniques speed development of space-based imaging applications.

Copyright 2008 California Institute of Technology. Government sponsorship acknowledged.EMBEDDED AP P L ICAT IONS

Page 23: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

The Jet Propulsion Laboratory (JPL), aNational Aeronautics and SpaceAdministration (NASA) laboratory, hasdeveloped support vector machine (SVM)classification algorithms used on boardspacecrafts to identify high-priority imagedata for downlinking to Earth. Thesealgorithms also provide onboard dataanalysis to enable rapid reaction todynamic events (Figure 1). These onboardclassifiers help reduce the amount of datadownloaded to Earth, greatly increasingthe science return of the instrument.

SVM classification algorithms are flyingtoday, using computational platforms suchas the RAD6000 and Mongoose V proces-sors. These legacy processors have only lim-ited computing power, extremely limitedactive storage capabilities, and are no longerconsidered state-of-the-art. For this reason,onboard classification has been limited toonly the simplest functions running on onlya subset of the full instrument data: forexample, only 11 of 242 bands in the caseof the Hyperion instrument on the EarthObserving-1 (EO-1) satellite.

FPGA coprocessors are an ideal candi-date for these algorithms. FPGAs can pro-vide significant improvement in onboardclassification capability and accuracy whencompared to the legacy processing plat-forms now flying.

To evaluate the effectiveness of FPGAsfor SVM algorithms, we implemented alegacy snow-water-ice-land (SWIL) classi-fier, originally developed for theHyperion instrument, on the Xilinx®

Virtex™-4 FX60 FPGA. To develop theapplication more quickly, we took advan-tage of the Impulse C-to-FPGA compilertools provided by Impulse AcceleratedTechnologies. These tools support therapid development of highly parallelhardware algorithms and applications.

This article describes our approach toimplementing the Hyperion linear SVM onthe Virtex-4 FX60 FPGA, as well as addi-tional experiments that we performed usingan increased number of data bands and amore sophisticated SVM kernel. Theseexperiments show the potential for moreefficient, higher performance onboard classi-fication using FPGA-embedded algorithms.

power and lower cost than general-purpose,single-board computers (SBCs). FPGA plat-forms offer breakthrough performance overradiation-hardened SBCs, leading to entire-ly new architectures for smart payloads.

To evaluate the potential for accelera-tion using FPGAs, we selected the XilinxML410 evaluation platform for develop-ment and demonstration of selected smartpayload concepts. The Xilinx ML410 eval-uation board comes equipped with aVirtex-4 FX60 FPGA that features twoembedded PowerPC 405 processors and alarge amount of available FPGA logic.

The specific algorithm we chose for ourinvestigations is the SVM classificationalgorithm. Algorithms of this type havefound broad application in general machinelearning and classification tasks, as well asfor onboard remote sensing. An SVM is amaximum margin classifier that calculates aseparating hyperplane between two labeledclasses such that the distance to the nearestdata in each class is maximized (Figure 2).By selecting such a maximum marginhyperplane, the SVM classifier can exhibitbetter generalization to new data than otherlinear classification methods.

The goal of training a support vectormachine is to learn a set of weights suchthat the sign of a weighted sum of dotproducts between the training data, xi, anda test vector – t – will correctly predict theclass of the new data vector.

SVMs also incorporate what is knownas the kernel trick, a method allowing themto be extended from purely linear to non-linear classifiers. This method involves for-mulating the training and testingalgorithms in terms of dot products andthen replacing the dot products with a ker-nel function that represents a dot productafter passing the arguments through somenon-linear function. The kernel functionpermits the high-, or even infinite-dimen-sional, dot products in the non-linear spaceto be computed using terms from the orig-inal, low-dimensional space.

SVMs are well suited to onboardautonomy applications. The property thatmakes SVMs particularly applicable is theasymmetry of computational effort in thetraining and testing stages of the algo-

FPGAs for Onboard ComputationOnboard computation has become a sig-nificant bottleneck for advanced, space-based scientific and engineeringapplications. Currently available spacecraftprocessors have high power consumption,are expensive, require additional interfaceboards, and are limited in their computa-tional capabilities.

Recently developed hybrid FPGAs, suchas the Xilinx Virtex-4 FX device, offer theversatility of running diverse software appli-cations on embedded processors while at thesame time taking advantage of reconfig-urable hardware resources, all on a singlechip. These tightly coupled, single-chiphardware/software systems offer lower

Second Quarter 2008 Xcell Journal 23

H

H

H

+

margin

Figure 1 – NASA uses smart payloads to classify Earth image data and reduce the

amount of data required for downloading(Image courtesy of NASA).

Figure 2 – Calculating a separating hyperplaneusing an SVM. The circled data points are the

support vectors that lie on the margin.

EMBEDDED AP P L ICAT IONS

Page 24: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

rithm. Classifying new data points requiresorders-of-magnitude less computationthan training because the process of train-ing an SVM requires solving a quadraticoptimization problem.

SVM training requires O(n 3) operations,where n is the number of training examples.In contrast, testing a new vector with atrained SVM requires only O(n) operations.Faster training algorithms that exploit thespecific structure of the SVM optimizationproblem exist, but the training remains theprimary computational bottleneck.

After training the SVM, many of theweights, wi, will be equal to zero. Thismeans that these terms can be ignored inthe classification formula. Input vectorsthat have a corresponding non-zero weightare called support vectors. Reducing thenumber of support vectors is key to suc-cessfully deploying an SVM classifier onboard a spacecraft, where there are severeconstraints on the amount of CPUresources available.

Previously deployed classifiers have usedsuch reduced-set methods, but were stillconstrained to operate on only a subset ofthe available classification features.Removing such bottlenecks is critical torealizing the full potential of SVMs as anonboard autonomy tool.

Partitioning the ProblemWhen using FPGAs with embedded proces-sors, efficient partitioning of algorithmsbetween software and hardware is importantto achieve high performance. For theFPGA-based development of the SVM, weimplemented a previously software-onlylegacy algorithm in the FPGA hardware fab-ric to take advantage of the FPGA’s high-speed parallel processing capabilities. Theimage file input and classification file outputare managed within the embeddedPowerPC processor using the CompactFlashcard provided on the ML410 board.

In this implementation, the softwareside of the application is coded in C andcompiled to the embedded PowerPC 405processor using the Xilinx EDK tools.The embedded software application readsan input image file consisting of 857,856pixels. The image file is read from the

CompactFlash card installed in theML410 board.

The software-side application streamsthe image data to the SVM, which is alsowritten in C but has been compiled(using the Impulse C-to-FPGA compiler)to FPGA hardware. The SVM hardwareprocess performs the required SVM oper-ation on the image and streams the resultsback to the PowerPC 405 processor. Theprocessor then writes the pixel classifica-tions (e.g., snow, water, ice, land, cloud,or unclassified) to an output file on theCompactFlash card (Figure 3).

The PowerPC ran the software portionof the task, which sends data to and collectsdata from the SVM hardware module. Wechose to use the PowerPC instead of aMicroBlaze™ processor because thePowerPC can operate at triple the clock fre-

quency of the MicroBlaze processor. Also,the MicroBlaze processor would be instan-tiated in valuable FPGA fabric, whereas thePowerPC exists external to the fabric.

Because the 256-MB DIMM is thelargest source of memory on the board,we used it as main memory for the pro-gram. The processor local bus (PLB) is ahigh-speed bus (compared to the on-chipperipheral bus [OPB]) that allows for fastdata transfer to/from the memory andSVM core peripherals. The 16-GBCompactFlash card holds the input andoutput data files, which are too large to fiton the DIMM. The UART was used fordebugging output. The OPB is a low-speed bus that serves as the default inter-face between the processor and theSystemACE™ interface controller andUART peripherals.

24 Xcell Journal Second Quarter 2008

InputImage File

OutputImage File

Data Input/Output

SVM

Figure 3 – Software/hardware partitioning for the SVM algorithm

EMBEDDED AP P L ICAT IONS

Page 25: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Second Quarter 2008 Xcell Journal 25

In support of partitioned software/hard-ware applications such as this, the Impulsetools include a library of C-compatible func-tions that implement a number of process-to-process communication methods. Thesemethods include streaming, shared memory,and message passing. For this application,the Impulse C streaming programmingmodel was the obvious choice.

In Impulse C streaming applications,hardware and software processes communi-cate primarily through buffered datastreams implemented directly in hardware.This buffering of data makes it possible towrite parallel applications at a relativelyhigh level of abstraction without the cycle-by-cycle synchronization that would other-wise be required.

Figure 4 illustrates the design flow forC-to-hardware compilation using theXilinx FX60 FPGA as a target.

On the software side of the application(in this case on the PowerPC 405 proces-sor used for hardware-level testing),Impulse C functions are used to open andclose data streams, read or write data onthe streams, and, if desired, send statusmessages or poll for results. In the case ofthe Virtex-4 FX, stream reads and writescan be specified as operations that takeadvantage of either the PLB or the auxil-iary peripheral unit (APU) interface.

Generating Parallel FPGA HardwareTo create the hardware portion of our appli-cation, we used the Impulse C compiler togenerate synthesizable HDL files ready touse with the Xilinx EDK tools. In additionto generating HDL files, the Impulse com-piler also generates additional files requiredby the EDK tools, including the neededPLB and APU bus interfaces. The ImpulseC compiler performs a variety of low-leveloptimizations, including C statementscheduling and loop pipelining, savingapplication developers a great deal of timethat would otherwise be spent performingtedious, low-level hardware optimization.

The Impulse C compiler performs theseoptimizations and generates hardware inthe form of either VHDL or Verilog. Thishardware can then be synthesized usingFPGA tools such as Xilinx ISE™ softwareand Platform Studio. On the processorside, the compiler generates run-timelibraries ready for use on the embeddedPowerPC processor.

Validating the AlgorithmThe output of the combined software/hard-ware application is a file comprising a col-umn of integers indicating the resultingclass of each pixel in the image. To validatethe results of the experiment and see theresults, we used MATLAB to reformat thecolumn of integer values to the original

pixel-wise dimensions of the image. Eachclass was assigned an arbitrary color, and thenumber of pixels belonging to each classwas tabulated. We could then easily calcu-late the percentage of pixels belonging toeach class as well as visualize the resultingfile of classified pixels as a colored image.

This validation was required to meettwo goals of this project. First, it was nec-essary to validate both the pixel classifica-tion results from the legacy (software-only)version of the SVM and the generatedhardware implementation of the SVM.This was necessary to verify that the integerand floating-point calculations performedin the FPGA (in hardware) returned thesame results (within acceptable margins) asthose observed in software currently flying.

C LanguageApplications

GenerateFPGA

Hardware

GenerateHardwareInterfaces

GenerateSoftwareInterfaces

HDL Files

C SoftwareLibraries

The Impulse C compiler performs these optimizations and generates hardware in the form of either VHDL or Verilog. This hardware can then be synthesized using

FPGA tools such as Xilinx ISE™ software and Platform Studio.

Figure 4 – Design flow from C-code to FPGA-embedded application

EMBEDDED AP P L ICAT IONS

Page 26: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

We began the validation process by com-paring the pixel classification percentageresults to those reported by the SVM used inthe currently flying EO-1 satellite. The clas-sification percentages show good agreement,particularly for the snow and water classes.

Image visualizations were also impor-tant in this effort. Our resulting visualiza-tions show excellent agreement with theresults from the EO-1 (Figure 5).

In addition to the qualitative compar-ison of the images, we also conducted apixel-by-pixel comparison of the legacyalgorithm results and our classifications.The pixel-by-pixel classification compari-son showed that 76.8% of the pixel clas-sifications in our results matched those ofthe Autonomous Sciencecraft Experiment(ASE). We believed that the remaining

discrepancies were caused by the differ-ences in the training data sets of theSVMs, but nonetheless we decided to ver-ify the hardware implementation on apixel-by-pixel basis.

To completely validate the algorithm,we compared the outputs of a softwareimplementation, using the same inputdata, to the version implemented in hard-ware by the Impulse C compiler. The twoimplementations produce identical classifi-cations on a pixel-by-pixel basis. The com-bination of the good agreement of ourresults with the legacy ASE results, as wellas the independent results from the soft-ware platform, led us to believe that ourimplementation was valid. All of these val-idations were performed using a combina-tion of C language and MATLABprogramming methods, with no need touse hardware design methods or hardwaredescription languages.

Extending the AlgorithmHaving successfully implemented the legacySVM designed for Hyperion, we then con-sidered two extensions to the C-languagealgorithm: using a larger number of bandswith the same linear kernel SVM and creat-ing a new SVM with a nonlinear kernel. Forthe expanded linear kernel SVM, we arbi-trarily selected 30 of the available 242 bandsin the image. For the non-linear kernelSVM, we used the same 11 bands as thelegacy SVM, but with a modified kernel.Because training data was not available forthe original legacy SVM, we could not gen-erate new SVMs that would be comparableto it, so we used new training data to gen-erate the two new SVMs. We also generat-ed a new, 11-band linear-kernel SVM forcomparison to the legacy SVM.

Using the C-to-hardware compiler, wewere able to quickly experiment with thesealternative implementations and compareresults, both in terms of accuracy and interms of performance. The hardwareimplementation of these SVMs producedresults that agree very well with the soft-ware simulations of the algorithms anddemonstrate significant increases in over-all system performance.

We were also able to determine (usingXilinx synthesis tools) the FPGA fabric uti-lization percentages for each of theseSVMs, as shown in Table 1.

ConclusionFPGAs with embedded processors aredemonstrating levels of performance andefficiency that were previously impossibleusing traditional processors. Hardwareacceleration of SVM algorithms promisesto dramatically improve onboard data pro-cessing in future science missions.

By using embedded FPGAs such as theVirtex-4 FX60 device, we can implementincreasingly advanced SVMs with “room togrow” in onboard resources. Our resultsdemonstrate that we can achieve even ourmost advanced SVM extension, a polyno-mial kernel, using just one-third of theDSP blocks and slices available in theVirtex-4 FX-60 FPGA.

Software-to-hardware design toolsplayed an important role in the fast proto-typing and development of these SVMalgorithms. The Impulse C tools allowed usto more easily manage the complexity ofthe application and to experiment withalternative implementations. For testingand algorithm validation, the XilinxML410 board provided an excellent andcost-effective development target.

26 Xcell Journal Second Quarter 2008

[The color key is blue = water, cyan = ice, dark purple = snow, lavender = unclassified]

Figure 5 – A comparison of the results from a) the legacy 11-band SVM implementation,

b) the FPGA-accelerated implementation, andc) the original hyperspectral image.

FPGA Resources Total on V4FX60 Linear (11 bands) Linear (30 bands) (2,1) PolynomialSlices 25,280 1,151 (4%) 2,253 (8%) 2,082 (8%)Slice Flip-Flops 50,560 1,290 (2%) 1,337 (2%) 2,519 (4%)Four-Input LUTs 50,560 1,838 (3%) 3,110 (6%) 3,287 (6%)FIFO16/RAMB16s 232 2 (1%) 2 (1%) 5 (2%)DSP48s 128 4 (3%) 4 (3%) 12 (9%)

Table 1 – Device utilization for an FPGA-based SVM algorithm

EMBEDDED AP P L ICAT IONS

Page 27: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell
Page 28: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Viviane CattinResearch EngineerCEA Grenoble Leti Minatec [email protected]

Bernard GuilhamatResearch EngineerCEA Grenoble Leti [email protected]

Angélo GuigaResearch EngineerCEA Grenoble Leti Minatec [email protected]

Sébastien RiccardiHardware Activity ManagerEASii [email protected]

Understanding the dynamic behavior ofcars, in particular two-dimensional hori-zontal speed parameters (also known as thecar’s slip angle), is critical to optimize per-formance during races and improve essen-tial equipment like tires.

You can estimate horizontal speedparameters through indirect modeling ofwheel speed and yaw-angle measurements;however, this method suffers from model-ling errors and other uncertainties. Someoptical sensors give measurements that are,in practice, not reliable in wet conditions.

We set out to develop a compact andreal-time sensor that could deliver a directslip-angle measurement with high accuracyand flexibility in various conditions (dry,wet, humid, or snowy). In this article, we’lldescribe how the Xilinx® Virtex™ FPGAfamily, with its embedded MicroBlaze™processors, helped us achieve our goal.

Sensor Measurement PrinciplesA laser velocimeter is commonly used fortaking speed measurements in the print-ing and textile industries. The velocime-ter takes laser images of the surface anduses an autocorrelation function to com-pute accurate speed estimates.

We extended this approach to measureboth longitudinal and transversal speed

components. For our sensor, two laserdiodes emit elliptical beam shapes to illu-minate the road. The first front laser beamruns parallel to the longitudinal axis of thecar, while the second back laser beammeasures orthogonally. Two linear photo-diode arrays collect the reflected lightbeams on the ground.

A pixel from the front photodiode arrayis compared to all pixels on the back pho-todiode array. The back pixels provideinformation about slip angle or transversalspeed, while a corresponding time delaygives longitudinal speed estimation. Figure1 illustrates the general optical configura-tion of the sensor.

We then developed an additional func-tionality to our sensor. Because of the laserbeam shape, we inserted a standard triangula-tion process to measure ground height varia-tion beneath the car. This parameter is usefulfor understanding the behavior of the vehicleand also improves overall system accuracy.

Finally, we developed an innovativeprocess to avoid optical disturbances. In

28 Xcell Journal Second Quarter 2008

Using the MicroBlaze Processor to Develop Speed Sensors for F1 Racing CarsCEA Leti Minatec, Michelin, and EASii IC developed an accurate and real-time optical sensor prototype that calculates both speed and slip angle.

EMBEDDED AP P L ICAT IONS

Page 29: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

wet conditions, random spurious reflec-tions disturb systems locally and cor-rupt the calculation of correlations. Oursolution is two-fold: inclining the laserbeams at a specific angle and using pre-processing to split the optical signals.We successfully tested our sensor in realtime in various conditions.

Embedded IntelligenceDuring the development of our sensor, wedesigned and validated the signal processingalgorithms using a reconfigurable FPGA.Unlike other targets such as DSPs or ASICs,an FPGA allowed us to adapt routines andsoftware architecture depending on testresults and evolving specifications.

Software ArchitectureFigure 2 shows the main architecture ofthe sensor as proposed by Leti, the researchlaboratory in charge of this project. Threeprocesses run concurrently:

• The height process calculates groundheight variations.

• The speed process computes longitu-dinal speed and slip angle from cor-relations and height estimations ofthe car.

• The output process sends sequentialoutputs on a high-speed, CAN-stan-dard automotive bus system.

These processes share variables throughmemory access.

Figure 2 includes some key points markedwith circles. As in the example previouslymentioned, signal chopping improved ourability to take measurements in wet condi-tions. Still other developments are essential toprovide accurate measurements:

• Sampling frequency is the constantparameter that ensures accurate meas-urements. We controlled the acquisi-tion frequency of diodes so that theground sampling remained constantduring vehicle displacement. A controlloop adjusted the sampling frequencyaccording to the speed estimation andprediction, part of the speed processpreviously described.

Second Quarter 2008 Xcell Journal 29

Laser

1 2Laser

685 nm

50 mW

Class 3B

Vtrans

V

H Vlongi

HEIGHT Process

Average (y/n)

Height H Estimation

Barycentre Calculation

Height Pixels

Adaptive Threshold

OUTPUT Process

Prediction: Sout Dout

CAN Frame Formation

... CAN Bus ...

< S D H Control Flags >

!

SPEED Process

Average Subtraction

Front Pixels Back Pixels

FIR Filter

Adaptive Chopping!

Correlation 1D1 x 128 Times

Maximum Extraction

Interpolation

Over

Sam

plin

g

Fit : Delay, Transersal Pixel

Correlation 1D8 x 5 Times

Fs Loop ControlModel Updating

Maximum Extraction

Interpolation

Fit : Delayi, Transersal Pixeli

Sorting Average

Error Management

Speed Si, and Slip Angle Di

Estimation

!

!

!

Figure 2 – Main software architecture with key points

Figure 1 – General optical configuration of a laser speed sensor. Three photodiodesarrays measure longitudinal speed, height measurement, and transversal speed

EMBEDDED AP P L ICAT IONS

Page 30: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

• Oversampling of ground signalsimproved measurement accuracy, par-ticularly for the high speeds (above 190kilometers per hour) of F1 racing cars.At these frequencies, we encountered atechnological ceiling of photodiodearray integration time. Therefore, weaveraged several slightly shifted meas-urements to improve accuracy (similarto when using very fast oscilloscopes).

This routine additionally offered a crit-ical way to evaluate the quality of sen-sor measurements. If all calculationsmatched, we could confidently believein the accuracy of the measurement.

• Taking into account height estimationwhen calculating speed improves meas-urement quality; we used the car’sground height variation to refine opti-

cal parameters and added a suitablethreshold to the signal to avoid sunperturbations on height calculation.

Other smaller processes run in parallel.For example, a specific algorithm evaluateswhether speed is equal to zero and commu-nicates that the car is stopped.

Hardware ArchitectureWe wanted a programmable embedded tar-get that offered the possibility to acceleratesoftware run times, used reconfigurableparameters, and facilitated evolutions of thealgorithm. An FPGA had all of these fea-tures, as well as offering the potential toparallelize correlation calculations.

We used two FPGAs from the Virtexdevice family: one devoted to preprocessing(sensor control, digital-to-analog converter,memory, filters, averages, height processes)and one focused on the main processing.

We added a MicroBlaze embeddedprocessor inside the second FPGA toimplement our high-level algorithm. Itincludes mainly conditional instructions,with few mathematical calculations. Usingthis processor allowed us to apply mathe-matical functions (like trigonometric func-tions, for example). The soft-core processoris also more user-friendly (based on C lan-guage) and facilitates debugging.

Figure 3 shows the architecture as pro-posed by EASii IC, the company in chargeof hardware realization.

Component ChoiceThe sensor must operate through the severeconditions of F1 racing cars, which includelow weights, reduced size, high vibrations,and high accelerations. We used bothDC/DC converters and low dropout regu-lators to avoid major perturbations fromthe car’s power supply.

The output is transmitted over a high-speed automotive CAN bus, thanks to anSJA1000 chip manufactured by NXP

30 Xcell Journal Second Quarter 2008

FPGA Processing

XC2V1000

FPGA Processing

XC2V1000

CAN

Controller

CAN

CTRL

CLK

Manager

CLK

Manager

Average

Average

Sequencer

Preprocess

Filter

30 Hz

30 KHz

Filter

30 Hz

30 KHz

Filter

30 Hz

30 KHzZBT

Control

ZBT Memory

256K x 16 bits

Sensor CTRL

Buffer Maximum

Buffer Maximum

INTERCORR

MicroBlaze ProcessorRISC 32 Bits

Sequencer

Process

Energy

Pixel

Buffer

Column

Sensor

FL

Analog

Channel

FL

Sensor

LS

Analog

Channel

LS

Sensor

HS

Analog

Channel

HSa

Analog

Channel

HSb

Buffer

Sensor

LS

Buffer

Sensor

LS

Figure 3 – Main software architecture with four analog inputs, digital-to-analogconverter timing, memory storage, calculations, timing, and output control

We wanted a programmable embedded target that offered the possibility to accelerate software run times, used reconfigurable

parameters, and facilitated evolutions of the algorithm.

EMBEDDED AP P L ICAT IONS

Page 31: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Second Quarter 2008 Xcell Journal 31

Semiconductors. To improve the heattransfer of the most consuming compo-nents (such as the analog-to-digital con-verter), we added a thermal paste, whichfacilitated thermal exchanges with themetal packaging. Additionally, a thermalsensor gives the temperature of the sensorto prevent overheating.

To avoid connection failure and stressbreakdowns, we added glue to the largestelectronic modules. Most of the basic com-ponents of our system are off-the-shelf,which means that further optimization is

possible if we applied our system to a specif-ic design. The total power consumption ofthe sensor is ~10W. The total size (195 mmx 107 mm x 35 mm, including both elec-tronic and optical elements) facilitates rapidmounting on the car, either on the keel forracing cars or anywhere on the chassis.

PerformanceWe set up and calibrated our prototype atthe Michelin Technology Centre, our part-ner in this project and the eventual end userof the sensor. They have a dedicated tool for

simulating high-speed and slip-angle varia-tions. The sensor is designed to work withvelocities from 2 to 400 kilometers per hourand can calculate slip angles from -10° to+10°. We reported relative errors of 0.5%on speed, 0.3 mm on ground height varia-tion, and less than 0.1° on slip angle.

We have conducted more than 50 trialssince 2003 at the Michelin TechnologyCentre and with F1 partner teams. Thesemeasurements allowed us to significantlyimprove sensor performance and validateits use under a wide variety of conditions.

Figure 4 presents experimental results inwet conditions. Outputs of the other com-mercial speed sensor are irrelevant becauseit was very disturbed by the behavior ofwater on asphalt. Our prototype presentsmore robustness; F1 car experts considerthese results fully functional.

ConclusionVarious experiments have shown that ourprototype addresses the accuracy androbustness necessary for F1 race cars. Weassume that these results show the engi-neering maturity of our sensor, especiallywith its innovative ability to operate onwet surfaces – a key challenge for non-contact speed sensors.

The simplicity of the global architec-ture (which includes electronic devices aswell as optical parts) makes us confidentthat our optical speed sensor prototypecould be implemented at a low cost. Wehave already studied a new hardwaredesign that fully exploits the availablespace between optical elements, reducingpackage volume by a third.

Future works may include an imagesensor to simplify the optical part andfusion with other data (such asaccelerometer measurements) to enhancesensor functionalities.

For more information, visit Leti (www-leti.cea.fr) or EASii IC (www.easii-ic.com) ore-mail [email protected].

We wish to acknowledge Sébastien Dauvé, BrunoFlament, William Foucault, Jean-Michel Ittel, PhilipePeltié, and Jean-Charles Vidal from Leti research laboratories and Jean-Baptiste Monnard from EASiiIC for their contributions to this project.

Ph di d A

Figure 5 – Comparative result on wet surfaces (LETI sensor in blue, other commercialsensor in red) showing speed (top), slip angle (middle), and height output (bottom)

Figure 4 – Sensor photography and main parameters

EMBEDDED AP P L ICAT IONS

Page 32: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Robert Henderson Development EngineerAgilent [email protected]

A gas chromatograph (GC) is a chemicalanalysis instrument that separates volatilesubstances in a complex chemical sample.GCs help answer such basic questions as:

• “What contaminants are in thisground water”?

• “Is this gasoline formulated correctly”?

• “Are there any residual solvents in this pill”?

The basic mechanism for gas chromato-graphic separation is the distribution of asample between two phases (a stationaryphase and a gas mobile phase). In modernchromatography, the stationary phase iscoated on the inside of a long narrow tubeknown as a column.

As the sample passes down the column,the different chemical components travel atdifferent rates and are detected at the exitof the column by a detector. In addition tothe gas supporting the flow of the sampledown the column, each detector also usesbetween one and three supporting gases(see Figure 1).

Agilent Technologies (spun off fromHewlett Packard in 1999) has been a part ofthe GC business since 1965. In 1973,Hewlett Packard introduced the world’s firstmicroprocessor-controlled analytical instru-ment, the 5830 GC, and over the yearsintroduced a series of next-generationinstruments, including the Agilent 7890Ahigh-performance GC in 2007.

32 Xcell Journal Second Quarter 2008

FPGA-Based Controller Enables Precise Chemical AnalysisFPGA-Based Controller Enables Precise Chemical AnalysisAn embedded controller using a Spartan-3 device and MicroBlaze processor provides electronic control of gas flows and pressures.

An embedded controller using a Spartan-3 device and MicroBlaze processor provides electronic control of gas flows and pressures.

EMBEDDED AP P L ICAT IONS

Page 33: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

GCs have evolved measurably over thelast 20 years, from simple mechanical pres-sure and flow regulators to sophisticatedelectronic closed-loop pressure and flowcontrollers that allow for the storage andretrieval of the entire instrument setup.Additionally, this capability also allows forprogrammable setpoints and for the log-ging of any deviation from the setpoint(s).

Spartan-3 FPGA-Based Pneumatic ModulesBecause of the disparate different applica-tion needs of our customers, we designed

pneumatic channels in a closed-loop man-ner, offloading the CPU in the base unitfrom this real-time task (Figure 2).

Four-Wire Interface (Module to Base Unit)From the base unit to the module, theentire interface comprises an unshieldedfour-wire cable: two wires are for powerand two are for communications. Thepower supplied is an unregulated, full-wave-rectified +24V, and the communica-tions path is a bus LVDS system.

The LVDS communications path is adifferential signal path that minimizesnoise susceptibility issues and allows forcable lengths long enough to place modulesanywhere inside the instrument, maximiz-ing instrument configurability options.The bus LVDS communications operate ina half-duplex mode, allowing the same pairof wires that sends commands to the mod-ule to provide command responses fromthe module.

Additionally, because the communica-tions interface in the base unit is alsoimplemented with an FPGA, the LVDScommunications link between the twoFPGAs required essentially no hardware,saving cost, reducing complexity, andimproving reliability.

MicroBlaze Embedded Processor CoreIn order to implement the embedded con-troller, we configured the Xilinx®

MicroBlaze™ embedded processor in theFPGA using the Platform Studio tool pro-vided in the EDK development system. Thistool not only defines, synthesizes, and routesuser logic, utilizing MicroBlaze processor IPand associated bus interface designs, but alsodevelops and compiles the program code forthe MicroBlaze processor. All FPGA hard-ware description is in VHDL, and allMicroBlaze processor coding is in C.

Specifying the optional barrel shifterand hardware divider in the microprocessorhardware specification file (system.mhs)enabled the processor system to function asa real-time controller. Performing opera-tions such as bit shifting or division in Ccode instead of VHDL hardware takes toolong for some of the time-critical PID con-troller processes.

the 7890A GC with a modular structure,allowing as many as six pneumatic modulesto communicate with a common “baseunit” that provides the central CPU andmemory, communication ports, keyboardand display interfaces, and power supplies.

Pneumatic modules can regulate up tothree channels of gas for each inlet ordetector. Each module is essentially anembedded controller that receives com-mands from the CPU in the base unit, butotherwise operates independently. Themodule controls the three independent

Second Quarter 2008 Xcell Journal 33

Inlet(Carrier Gas)

Detector

Sample

Sample

Separation

Column

Gas Chromatograph

Detector

Support

Gases

Multiplexed

A/D(8 Input Channels)

Spartan-3

FPGA

(XC3S200)

EEPROMValve

Drivers

Module

Power

Supplies

4-Pin

Connector

BASE

UNIT

CPU

+24V

LVDS 2-Wire

Valve

Driver

Monitor

Pre

ssure

Sensor

Flo

w S

ensor

Pre

ssure

Sensor

Pro

port

ional

Valv

e

Pro

port

ional

Valv

e

Pro

port

ional

Valv

e

Figure 1 – Block diagram of gas flows in a gas chromatograph

Figure 2 – Block diagram of Spartan-3 FPGA-based pneumatic module

EMBEDDED AP P L ICAT IONS

Page 34: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Using the Xilinx-supplied templateVHDL file, the user logic interfaces to theMicroBlaze processor OPB bus with anOPB to IPIF (IP interface) block. Thisblock takes care of the OPB bus protocoland interface signals and presents a simpli-fied interface to the user logic called the IPInterConnect (IPIC).

The major user logic VHDL blocks com-plement the external hardware controlled bythe MicroBlaze processor, resulting in theoverall FPGA system design (Figure 3). Let’sreview the major components.

Multiplexed A/D Converter ControlA common delta-sigma analog-to-digitalconverter (ADC) is multiplexed betweeneight inputs. With this type of ADC, whenthe input changes the output filter must be“flushed” of the previous channel’s data. AVHDL block manages the ADC and itsserial data output.

When the MicroBlaze processor selectsa specific channel, the VHDL subsystemreads and throws away the old channel’sreadings, then reads and averages dataautomatically from the ADC. TheMicroBlaze processor specifies to theVHDL block how many readings to aver-age, and thus can perform other opera-tions (such as executing a new commandfrom the base unit CPU) until an aver-aged reading is ready.

Serial EEPROM ControlThe VHDL hardware manages the com-mands, addresses, and 8-bit data to andfrom the serial EEPROM and presents amemory-mapped, 32-bit-word-wide inter-face to the MicroBlaze processor. TheEEPROM stores calibration coefficientsfor the sensors in the pneumatic moduleand PID coefficients, as well as other con-figuration and identification information.

Valve PWM Drive and MonitorAfter each PID calculation, the MicroBlazeembedded processor outputs another valvedrive setpoint to the memory-mappedaddresses for the valve drive. The valvedrive VHDL state machine uses these val-ues to generate a pulse-width modulated(PWM) drive to the proportional valves at

65 kHz. This high a frequency is used sothat in addition to being above the audiblerange, it is filtered out well by the inductivetime constant of the valve, resulting in asmooth current profile.

Signals from the valve drive are readback into the FPGA for diagnostic purpos-es. This allows the MicroBlaze processor toexamine the voltages and determine if a

valve is installed and if the PWM drive isable to drive both high and low.

UART/FIFOWe implemented a UART and FIFO inVHDL hardware to handle the com-mands and responses with the base unitCPU. Because of the half-duplex proto-col, the pneumatic module must only bein the transmit mode during the responseto commands.

VHDL hardware automatically returnsto receive mode a fixed time after the com-mand response and presents to the

MicroBlaze processor a transmit andreceive FIFO along with the necessary flags(FIFO empty, FIFO full).

Basic Pneumatic Module OperationAs I mentioned earlier, the physical set-points supplied by the customer may bestatic or programmed. They are expressedin physical units such as psig (gauge pres-

sure) or ml/minute (flow rate) and can beset with a resolution of 0.001 psig and 0.01ml/min. These setpoints are sent at a rate of50 Hz to each pneumatic control module.

The basic job of the pneumatic con-troller module is to regulate the pressure orflow to the desired setpoint. This is donewith a modified proportional integralderivative (PID) controller.

Command InterpreterBecause there is no clock shared betweenthe base unit CPU and the LVDS pneu-matic modules, commands to the pneu-

34 Xcell Journal Second Quarter 2008

Multiplexed

A/D

UART

+ FIFO

4-Pin

Connector

MicroBlaze Processor

CPU + Memory

FPGA with Embedded Controller

IP Interface (IPIF)

OPB

IPICUser Logic

LVDS 2-Wire

Power

Supplies

Valve

Drive

PWM

EEPROM

Control

MUX

A/D

Control

EEPROMValve

Drivers

Valve

Driver

Monitor

Pre

ssure

Sensor

Flo

w S

ensor

Pre

ssure

Sensor

Pro

port

ional

Valv

e

Pro

port

ional

Valv

e

Pro

port

ional

Valv

e

Figure 3 – Overall pneumatic controller block diagram

EMBEDDED AP P L ICAT IONS

Page 35: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Second Quarter 2008 Xcell Journal 35

matic control module are not synchronizedwith the closed-loop operation of theembedded controller in the module.

For example, you can send commandswhile the MicroBlaze processor is in the mid-dle of PID calculations. Because the controlloop takes precedence, the command parsingand execution will be temporarily delayeduntil the MicroBlaze embedded processor isidle. This delay is short enough that itallows the command response to fall with-in the acceptable response-time limits setby the base unit CPU.

The simple command interface definescommands that set setpoints, read actuals,read and write the EEPROM, specify thegas type (H2, Helium, N2, ArCh4), as wellas a number of diagnostic tests on the sys-tem, A/D converter, and valve drives.

Filtered Sensor ReadingsThe sensors read data hundreds of times asecond. Two IIR filters in the MicroBlazeprocessor process the data, which makes useof the barrel shifter hardware configured aspart of the processor configuration file.

The first low-pass filter helps reduce rawsensor noise above the bandwidth of thecontrol loop that would result in a widercontrol band. The second low-pass filter is

used on data returned to the base unitCPU. It limits the bandwidth of the datato below 25 Hz so that there is no aliasingof the data in the 50-Hz rate of data to thebase unit CPU.

Control LoopThe PID controller function implements astandard PID control with a couple ofnecessary twists. Figure 4 shows the over-all operation of the PID controller.

The first modification of a standardPID is due to the +24V power supply. Afully loaded instrument could have 18proportional valves to power. Ratherthan implement a 20W-regulated +24Vsupply, we decided to use a simple andreliable full-wave-rectified and unregu-lated supply. A supply like this will trackthe AC voltage and ripple at twice theline frequency.

The magnitude of the +24V supply canrange from 22V to 28V, depending on theAC voltage to the instrument. This resultsin a variation in the open-loop gain of thesystem, which can affect stability.

Additionally, the AC ripple on the+24V supply modulates the drive to thevalve and can result in pressure or flowvariations that are at too high a frequency

for the control loop to attenuate away (thedisturbances are above the bandwidth ofthe controller).

We solved these problems by readingthe value of the +24V supply with the mul-tiplexed A/D converter and having theMicroBlaze processor calculate a feed-for-ward compensation term. If the +24V sup-ply increases, the valve drive isautomatically reduced to compensate. Thisrequires a divide operation. Because thishappens at over 100 Hz and takes awayfrom PID calculation time, the hardwaredivider was implemented in theMicroBlaze processor system specification.

The second modification from a stan-dard PID controller is due to the pneumat-ic system itself. The range of setpoints youcan define covers a broad dynamic range,with pressure setpoints from 0.2 psi up to150 psi and flows from 3 ml/min up to1,250 ml/min.

Not surprisingly, the dynamics of thepneumatic system over this range of opera-tion change a lot. It is possible, for exam-ple, to have more than a 30-dB gain changein the transfer function of the pneumaticsystem (ratio of sensor out to valve drive in)over these ranges. In addition, the band-width (poles) of the pneumatic systemvaries greatly over this operating range.

To obtain good control and stepresponse performance, you can alter thevalues of the PID coefficients based on thesetpoints specified. This alters both thegain and frequency response of the PIDcontroller to match the characteristics ofthe pneumatic system at that setpoint.

ConclusionNew techniques and improved precision inchemical analysis techniques have enabledour customers to meet increasingly stringentrequirements for chemical identification.

Part of that improvement comes from theincreased accuracy and precision of gas sup-plies within the GC. The new Agilent 7890Anetwork gas chromatograph continues thistradition, while the Xilinx Spartan-3XC3S200 FPGA with MicroBlazeembedded processor helped the productmeet its functionality, precision, cost, andmodularity requirements.

+24V

Feed-Forward

Calculation

+24V

Gain and

Bandwidth

Correction

Gain and

Bandwidth

Correction

Compensate

+24V

Multiplier

Proportional

Valve

Pneumatic

System

Pressure

or Flow

Sensor

Standard

PID

Controller

Error

Setpoint

+-

Figure 4 – Modified PID controller

EMBEDDED AP P L ICAT IONS

Page 36: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Paul ParkinsonSenior Systems ArchitectWind [email protected]

The development of modern defense sys-tems presents a familiar technological chal-lenge in that the requirements for increasedapplication functionality conflict with therequirements for minimal footprint interms of space, weight, and power (oftenreferred to as “SWaP”). This is a particularconcern for airborne platforms, especiallyunmanned air vehicles (UAVs) that havephysical constraints.

Some defense systems are physicallyvulnerable to interception by hostile forcesbecause of their operational role, includingintelligence, surveillance, and reconnais-sance (ISTAR) assets such as reconnais-sance aircraft, UAVs, and land-basedsensor systems. If these platforms are touse leading-edge technology, it must betechnology that cannot be compromisedor reverse-engineered.

In addition, field-based configurationand upgradability are essential to achieveinteroperability between new and rapidlydeployed coalition forces.

In this article, I’ll explain how Xilinx®

Virtex™-II Pro and Virtex-4 platformFPGAs, along with Wind River softwareplatforms, are ideal for the development ofISTAR systems, and present design tech-niques that will enable secure deployment,with the use of partial reconfiguration inthe field to help enable interoperabilitybetween coalition forces.

Technology ChallengesIn previous decades, defense systems usedmilitary-grade components from semicon-ductor manufacturers. These componentshad long life cycles, which were essential tosupport the in-service use of deployedhardware for 20, 30, or even 40 years.

In 1994, when U.S. Secretary of DefenseWilliam Perry first advocated the use ofcommercial off-the-shelf (COTS) compo-nents on U.S. military programs whereappropriate, other governments around theworld subsequently adopted this philoso-phy (to varying degrees). The migration to

36 Xcell Journal Second Quarter 2008

Reconfiguring the BattlespaceYou can develop aerospace and defense applicationswith Xilinx and Wind River technologies.

EMBEDDED AP P L ICAT IONS

Page 37: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

COTS has led to a move away from specif-ic military-grade components and toward agreater reliance on industrial-grade compo-nents. The latter’s shorter supported lifecycles could impact in-service lifetimes.

In addition, defense companies havedeveloped custom ASICs at great expensefor specialized functions in defense systems,particularly ISTAR systems. Replacing thesedevices in deployed systems can be prohibi-tively expensive. There is also an ever-grow-ing requirement to extend the in-service lifeof ASICs as the operational lifetimes offront-line defense systems are extended.

These challenges can now be addressedthrough the use of Virtex-II Pro andVirtex-4 FX FPGAs, given their increasedgate counts and incorporation of CPU andDSP functionality.

The platform FPGA also has the poten-tial to be used for algorithmic operations,but until recent years this has not beenexploited because it is difficult to expresssoftware algorithmic operations in million-gate applications in VHDL. However, theability to program reconfigurable logicfrom high-level software languages such asC as well as VHDL opens up the potentialof these devices to further exploitation.

Platform DevelopmentYou can readily exploit the potential of theplatform FPGA through the Xilinx PlatformStudio (XPS), which generates hardwareperipheral and IP definitions in VHDL,closely coupled with the automatic genera-tion of a VxWorks board support package(BSP) in C source code for the PowerPC405 processor core(s) contained within theVirtex-II Pro and Virtex-4 FX device fabric.

number of enhanced capabilities when com-pared to VxWorks 5.5, including technolo-gies that are critical in defense applications;advanced memory protection for applicationisolation using real-time processes; and adual-mode IPv4 and IPv6 network stack,which the U.S. Department of Defensemandated for new programs in 2003.

After XPS generates the VxWorks 6 BSP,it can be configured in the Workbench proj-ect configurator (Figure 1). The configura-tor enables components to be configuredin a hierarchical manner and parametervalues to be specified. When this has beencompleted, you can build the VxWorks 6kernel for your target device fromWorkbench through automated processes(Figure 2). Once VxWorks is running onthe PowerPC 405 core(s), you can useWorkbench for source-level applicationdevelopment (Figure 3).

The integration between XPS andWorkbench allows you to use a VxWorks 6-based common software platform on VirtexFPGAs for application or algorithm controlor gateway network interfaces. This pro-vides a very flexible approach that you canexploit for functions such as image com-pression and data link encryption to rele-vant NATO standardization agreements.

System and application software inthese devices (known as device software)are often implemented in an architecture-specific manner, using low-level program-ming languages such as assembly languageand custom software schedulers. Althoughthis has enabled exploitation of the hard-ware’s performance potential, it has alsomade the task of technology insertion andprogram upgrades more difficult.

You can overcome this problem by usingsoftware platforms based on open standards.For example, the Wind River GeneralPurpose Platform – VxWorks Edition incor-porates the Wind River Workbench devel-opment suite, which uses the Eclipseopen-source framework to provide seamlessintegration between tools for different partsof the device software development lifecycle, as well as a consistent GUI for devel-opers. This approach has tangible benefits interms of developer efficiency, transferableskills, and knowledge retention.

(The details of this approach were previ-ously discussed by Rick Moleres and MilanSaini in their article, “Generating EfficientBoard Support Packages” [Xilinx Embeddedmagazine, March 2006] and included inte-gration with Wind River’s Tornado 2.2 IDEand VxWorks 5.5 RTOS.)

Since the publication of that article, it hasbecome possible to undertake similar devel-opments using the state-of-the-art WindRiver Workbench development suite andVxWorks 6 RTOS. VxWorks 6 provides a

Second Quarter 2008 Xcell Journal 37

Figure 3 – Workbench target connection toPowerPC 405 core in Virtex FPGA

Figure 2 – Workbench build of Xilinx board VxWorks 6 kernel image

Figure 1 – Workbench kernel configuration for Xilinx ML410 VxWorks 6 BSP

EMBEDDED AP P L ICAT IONS

Page 38: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

On the runtime side, VxWorks 6.4introduced 100% conformance to thePOSIX PSE52 real-time controller profile,enabling software reuse from legacy sys-tems and the development of newportable applications.

Communications Security in Defense Systems DesignCommunications security (ComSec) isan important requirement in manydefense systems, especially in UAVs,which often need to maintain continu-ous communications links for remotepiloting and real-time image streaming.These communication links must besecure from eavesdropping and interfer-ence by hostile forces, so encryption isrequired for flight control and sensordata transmissions. You could imple-ment this encryption in software (whichplaces an additional load on the proces-sor) or in dedicated hardware logic. Youcould also use reconfigurable technology,which provides benefits in terms of per-formance and secure design (which I’llsoon discuss).

Let’s consider a hypothetical scenarioinvolving the use of Triple DES-encryp-tion on a communications link. TripleDES encryption provides a relatively highlevel of security by encrypting data threetimes using three 64-bit private encryp-tion keys. This is inherently more securethan single DES encryption but will takea processor three times longer to compute,because the processor performs theencryption as three sequential steps.

Given the parallelism inherent in aXilinx platform FPGA, you can implementthree encryption stages operating in paral-lel, with the output of one stage pipelinedinto the next stage in 64-bit words. Thisacceleration enables the passing of data athigher rates over secure downlinks.

For example, a 350-MHz PowerPC canTriple DES-encrypt a data stream at therate of 1.2 Mbps, whereas a lower power20-MHz Xilinx Virtex-II Pro FPGA canTriple DES-encrypt a data stream at therate of 22 Mbps, with each 64-bit wordprocessed in 57 clock cycles.

You could apply this encryption accel-

eration technique to provide a secureTCP/IP-based communications frame-work (using IPSec in conjunction withTriple DES or potentially 256-bit AESencryption) and data link protocols. Thiswould involve performing the networkpacket processing on the PowerPC proces-sor, with computationally intensiveencryption offloaded to the FPGA, actingas a coprocessor.

Information Security in Defense Systems DesignDefense systems now contain an increas-ing number of subsystems. In an airborneplatform, for example, there are avionicsystems (for flight control), mission sys-tems, and sensor systems for payloadssuch as electro-optic/infrared sensors(EO/IR) and synthetic aperture radar(SAR) (Figure 3). ComSec in this caserefers to the use of firewalls and encryp-

tion for the transmission and reception ofinformation securely between networkswithout transforming the informationduring transport.

Data communication between the avion-ics systems and mission systems will includethe passing of global positioning informa-tion, bearing, and altitude. However, infor-mation with differing security classificationsmay also need to be transformed securelybetween applications, subsystems, or net-works – this is known as InfoSec.

Preventing Reverse Engineering by Secure DesignThe implementation of a system destructsequence capability is possible for systemsthat are vulnerable to compromise or cap-ture by foreign forces. However, previousimplementations of destruct sequences may

not provide a rapid, complete, and irre-versible destruction of subsystems; althoughthey may achieve software destruction, theymay not sufficiently destroy all of the hard-ware architecture, especially fixed hardwarecomponents such as ASICs.

Thus, it could still be possible to per-form reverse engineering on some aspectsof the hardware design. To prevent this,reconfigurable technology offers the abilityto achieve rapid and complete destructsequences and prevent reverse engineering.

FPGA devices can be erased completely,leaving no trace of their original applica-tion. By asserting a specific signal, thedevice could be cleared in hardware in a fewhundred microseconds. You could also per-form the software destruct sequencethrough programmatic software controlissued from a remote system, a sensible sce-nario in the case of a UAV capture.

You can implement the sequence from a

VxWorks-based common software plat-form connected to the command and con-trol center through a secure IP-basednetwork, permanently and irreversibly eras-ing the content of a Xilinx FPGA using thePLD API for VxWorks Embedded Systems(PAVE), as shown in Figure 4.

I referred to Triple DES encryption inthe context of secure communication linksearlier, but you can also employ this methodwithin the content of the Xilinx FPGA (alsoknown as the payload, not to be confusedwith a UAV payload). When the bitstreamis read out from the FPGA, the Triple DESencryption must be decrypted before thebitstream can be decoded. This not onlyprotects the FPGA content from beingcopied blindly and reused; it also prevents itfrom being reverse-engineered should theUAV be intercepted.

38 Xcell Journal Second Quarter 2008

Avionics

Network

Mission

Systems

Sensor

Systems

Firewall Firewall

Figure 3 – Military airborne data networks

EMBEDDED AP P L ICAT IONS

Page 39: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Second Quarter 2008 Xcell Journal 39

ConclusionISTAR systems will spearhead the deploy-ment of new coalitions in response topotential or emerging conflicts, providingvital imagery intelligence and situationalawareness from reconnaissance missions.A common software platform and recon-figurable technology could be used toimplement codecs that support reconfigu-ration in the field, assisting in the rapiddeployment and interoperability withNATO and/or other coalition forces bysharing encryption keys. Platforms canalso be reconfigured on the fly to commu-nicate with legacy or incompatible sys-tems belonging to other coalitionmembers (if a suitable codec exists).

Reconfigurable technology not onlyprovides the ability to rapidly deploy newtechnology, but also the means to extendthe in-service lifetimes of deployed sys-tems. A network-enabled software plat-

form acts as a secure gateway to delivernew content and perform partial or fullreconfiguration of FPGAs while deployedin the field. This approach would harvestthe architecture shown in Figure 4, butemploy an upgrade application instead of asystem destruct application.

When considered together, these capa-bilities provide a technical advantage overtraditional fixed-chip solutions (Figure 5)and present a powerful argument for theiradoption on defense programs.

For more information, I recommend apaper by Saar Drimer of the University ofCambridge on the use of FPGAs in thedesign of secure defense systems. To read“Volatile FPGA Design Security – ASurvey,” visit www.cl.cam.ac.uk/~sd410/papers/fpga_security.pdf.

For more information about the WindRiver General Purpose Platform – VxWorksEdition, visit www.windriver.com.

CPUSecureTCP/IP

Ground Link

FPGAHardware

SoftwareSystemDestruct

Application

VxWorks

EncryptionAlgorithms

+ApplicationAlgorithms

PAVE

Technical

Advantage

DeploymentAdvantage

Time to Deployment - First to deploy increases technical advantage

Time in Field - Increases the in-life support yield while in field

Field UpgradeAdvantage

FPGA-Based

Solution

Fixed Chip

Solution

Time

Figure 5 – Reconfigurable technology life cycle

Figure 4 – CPU-controlled reconfiguration of a Xilinx FPGA

EMBEDDED AP P L ICAT IONS

Page 40: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

OR

Fastest FPGA Debug Available

+

=

Agilent Logic AnalyzersUp to 612 channels, up to 256 M deep memory 4 scope channels + 16 timing channels

Application software to increase visibility inside your FPGA

2

Agilent Mixed Signal Oscilloscopes

Agilent FPGA Dynamic Probe

See what these tools can do for you.www.agilent.com/find/fpgatools

Page 41: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Peter ThorwartlCEO/Sr. Systems ArchitectSo-Logic Electronic [email protected]

Franz SummerauerHardware EngineerAVL List [email protected]

Alfred PöelzlSoftware Engineer AVL List [email protected]

At the beginning of a new embedded sys-tem design, there are so many differentsolution options that it is hard to predicttheir impact on system cost, developmenttime, and performance.

Some example of possible choices are:

• Memory devices (DDR, SRAM, flash,SPI-Flash)

• Bus architecture and topologies (PLB,OPB, OCM)

• Number of bus masters and arbitra-tion schemes

• Processor architecture (PowerPC 405,PowerPC 440, Xilinx® MicroBlaze™ processor, ARM)

• Operating system (stand-alone, Xilinx microkernel, Embedded Linux,commercial RTOS)

• Single or multiple processors

• Interprocessor communication (sharedmemory, shared bus, interrupts)

• Interfaces to the outside world(Ethernet, PCI Express)

In this article, we’ll show how you can evaluate these different choices duringthe whole design process to build an optimized system.

Second Quarter 2008 Xcell Journal 41

Hardware/Software Optimization of Embedded SystemsHardware/Software Optimization of Embedded SystemsXilinx FPGAs provide a excellent opportunityto improve system performance during theembedded system design process.

Xilinx FPGAs provide a excellent opportunityto improve system performance during theembedded system design process.

S Y S T E M D E V E L O P M E N T

Page 42: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

System ArchitectureAVL offers modular test bed components,highly reliable instrumentation, andadvanced tools supporting the develop-ment and calibration process for all kindsand sizes of combustion engines (Figure 1).One part of this system is the embeddedGigabit Ethernet card discussed in this arti-cle, which collects data from as many asfour data acquisition boards through paral-lel burst interfaces, performs basic data pre-processing, and streams the pre-processeddata to a host system through a GigabitEthernet interface. Multiples of such acqui-sitions should be cascaded; therefore, weplanned to use two Ethernet interfaces, asshown in Figure 2.

The main components of this card are aVirtex™-4 FX60 device, three DDRRAMs, two Gigabit Ethernet interfaces,and one XC9500XL family CPLD. TwoPowerPC processors determine the func-tionality of the interface board.

Because of high speed requirements, themulti-port memory controller executeschecksum generation and the data manage-ment of four burst data interfaces. Themeasurement data is placed into an exter-nal high-speed DDR RAM. Both PowerPCprocessors are responsible for the data flowmanagement and Ethernet communica-tion. Five high-speed interfaces control theexternal data acquisition boards.

data sheet to select the device, and thenumber of I/O blocks, I/O standards, andI/O banks to select the package.

After this first experience, we decided touse the Virtex-4 FX family and FF676package, which allowed us to use differentdevices (FX20 with one PowerPC, FX40and FX60 with two PowerPCs). We had touse different TCP/IP stacks to use the fullperformance of the Gigabit Ethernet inter-face, so we selected the Treck stack.

A single PowerPC is too slow to movethe data from the burst interfaces to thememory and then to the Ethernet inter-faces, so we had to connect the burst data

Exploring with a Demo BoardA good way to explore the capabilities of anew FPGA family is to use a demo boardlike the ML403. First, we implemented thesimplest possible embedded system run-ning on one PowerPC (see Figure 3). TheCPU executes the software from internalblock RAMs. With this small system, wecould evaluate the throughput of theGigabit Ethernet interface. We could alsodetermine the memory bandwidth andspeed of the PowerPC.

We estimated the required logicresources like the number of flip-flops,block RAMs, PowerPCs, DCMs from the

I2C, LED

OPB_GPIO

FLASHBlock RAM

OPB_SPI

OPB_BRAM_IF

RS232

OPB_UART

DDR1

PLB_DDR

GE_PHY

External

PCORE

Hardmacro

CUSTOM

PLB_TEMACDCR_

INTC

PPC1

HARDTEMAC

PLB

DSOCMISOCM

FCMDCR

Block RAM

ISOCM_BRAM_IF

DSOCM_BRAM_IF

PLB_OPB

Figure 3 – Block diagram of simple system with ML403 demo board

Figure 2 – Board of data acquisition system

Figure 1 -Typical environment for the data acquisition system

42 Xcell Journal Second Quarter 2008

S Y S T E M D E V E L O P M E N T

Page 43: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Second Quarter 2008 Xcell Journal 43

S Y S T E M D E V E L O P M E N T

and Ethernet interfaces directly to thememory. We decided to connect one net-work interface to each PowerPC.

We are sure that a platform without areal-time operating system will make thedevelopment of this application easier. Thishas to be decided on a case-by-case basis.

Hardware DesignThe system uses three DDR RAMs: one foreach PowerPC and one for the measure-ment data storage. We had two design pos-sibilities: a special peripheral that is capableof bus mastering DMA transfer or a multi-port memory controller MPMC2 to enablesimultaneous access to the data acquisitionRAM. We selected the second option.

We were then ready to start developingour new platform. We mainly used MentorGraphics tools: DxDesigner for schematiccapture, Boardstation for PCB layout,Hyperlynx for signal simulation, andIODesigner for the interface between hard-ware and ISE™ software. The PCB designwas awarded the first prize in theTransportation & Automotive category inthe Mentor Graphics’ 2007 PCBTechnology Leadership Awards.

We also developed the peripheral coresfor the custom peripherals. We needed twocustom cores: one for transmitting data tothe transmission link (SO_TLINK_FSL)and one for capturing the block data(SO_NPI_FSL_BURST).

SO_TLINK_FSLWe already had VHDL code for the trans-mission link with a register interface, so wedecided to use a block RAM interface toconnect the register memory mapped to thePowerPC. One advantage of this solution isthat for every bus (OPB, PLB, OCM, andeven LMB for the MicroBlaze processor), ablock RAM interface controller exists, so wecould easily try the performance on differ-ent bus configurations. If connected to theOPB bus, we could share this resourcebetween the two PowerPCs. With this uni-versal approach, we had the freedom tomove the control software modules fromone PowerPC to the other.

Unfortunately, this solution was tooslow for two reasons. First, we saw that bus

arbitration takes too many clock cycles;even worse, because we used the OPB bus,we needed an additional PLB2OPB bridge.For the OCM bus, time for bus arbitrationwould be lower, but we would need a sec-ond DSOCM_block RAM interface, which

slows down the whole bus frequency.The other problem was that we could

not send further data until we hadacknowledgment that the packet wasreceived. One solution would have been tointegrate a FIFO in our core, but we could

FLASH

OPB_SPI

Block RAM

OPB_BRAM_IF

RS232

OPB_UART

I2C, LED

OPB_GPIO

GE_PHY

External PCOREHardmacro CUSTOM

PLB_TEMAC

DCR_INTC

PPC1

HARDTEMAC

PLB

DSOCMISOCM

FCMDCR

Block RAM

ISOCM_BRAM_IF

DSOCM_BRAM_IF

CTRLDATA

SO_OPB_TLINK

MEASUREMENT DATA

SO_OPB_BURST

PLB_OPB

GE_PHY

PLB_TEMACHARD

TEMAC DDR2

PLB_DDR

DDR1

PLB_DDR

I2C, LED

OPB_

GPIO

OPB_

BRAM_IF

Block RAM

SO_NPI_FSL_BURST

MEASUREMENT DATA

RS232

OPB_

UART

DDR1

PLB_DDR

GE_PHY

LL_TEMAC

DCR_INTC

PPC1

HARDTEMAC

DDR3

PLB

x4

CDMA NPI NPI NPI NPI CDMA PLB

MPMC2

PLB

DSOCMISOCM

FCMDCR

Block RAM

ISOCM_BRAM_IF

DSOCM_BRAM_IF

PLB_OPB PLB_OPB

FCM2FSL

GE_PHY

LL_TEMAC

HARDTEMAC

DCR_INTC

PPC2PLB

DSOCMISOCM

FCMDCR

Block RAM

ISOCM_BRAM_IF

DSOCM_BRAM_IF

FCM2FSL

External PCOREHardmacro CUSTOM

FLASH

OPB_

SPI

FLASH

OPB_

SPI

CTRL

DATAx5

SO_FSL_TLINK

Figure 5 – Block diagram of optimized system with MPMC2

Figure 4 – Block diagram collecting performance data with PLB

Page 44: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

44 Xcell Journal Second Quarter 2008

also use the FSL links instead and eliminateboth disadvantages. We estimated the bestFIFO size, because we could optimize itlater under different load conditions. TheFSL interface automatically uses theSRL16 for small FIFOs and the FIFO16Bhard block for larger depth.

SO_NPI_FSL_BURSTWe wanted the SO_NPI_FSL_BURST coreto move the data directly to the MPMC2, sowe needed to develop an interface betweenthe test card interface and NPI (native portinterface similar to the local link interface)of the MPMC2. The PowerPC must controlthis interface to specify the address locationand burst length. We decided to use the FSLinterface instead of the GPIO because theFSL is really simple to handle and saves afew clock cycles.

For the simulation of the peripheralcores, we used ISIM to verify the funda-mental functionality of our cores. For a sys-tem simulation with the PowerPCs, weused ModelSim PE with the SWIFT inter-face for the processor models.

In the implementation process, we gen-erally used the generated make file from theXilinx Platform Studio software and addeda lot of features like checkout and taggingfor source code repository, tar ball, and .zipfile generation, different programming fileformat generation, debugging with theChipScope™ analyzer core inserter, andcompilation of different application soft-ware for both PowerPCs.

Our scripts ran the development toolson Windows XP and Linux (CentOS 5.1)and supported different boards and devices.

Real LifeWe first migrated our design from ML403to our new target board to check all externalperipherals (DDR, Gigabit Ethernet, SPI,I2C) (see Figure 4). We can reuse this designlater for hardware test for manufacturing.

We then used small internal blockRAMs on the data and instruction sideOCM buses, with boot loaders for eachPowerPC and fast interrupt routines. Theboot loader moves the data from the SPIflash to the dedicated DDR RAM. No par-allel flash is needed. We use the same SPI

flash for booting the FPGA. The CPLDconverts the signals from SPI to the usualserial interface for booting.

We implemented the I2C with theOPB_GPIO core and a software driverwithout the special core OPB_IIC core,because it was free of charge and we need-ed only to read an ID tag once after boot.

For the interprocessor communicationbetween the two PowerPCs, we used ashared memory implemented in a singleported block RAM connected to the OPBbus with the OPB_BRAM controller.

We implemented two separated inter-rupt controllers, one for each PowerPCconnected to the DCR bus.

The central core in our design is themultiport memory controller with eightports: four for data acquisition(SO_NPI_FSL_BURST), two PLB ports,one for each PowerPC, and two CDMAports to connect the Gigabit Ethernetinterfaces (Figure 5).

An interesting design phase was findingperformance bottlenecks and fine-tuningthe system performance. We connectedour board to real test equipment to getreal-life data.

One very useful tool for qualifying theresults is the combined XMD andChipScope analyzer for hardware/softwareco-debugging. You can trigger with theChipScope analyzer and stop the softwaredebugger, and vice versa.

We also experimented with interruptpriorities, FIFO sizes on the MPMC2, andarbitration schemes inside the MPMC2.We also moved some software modulesfrom PowerPC1 to PowerPC2 and viceversa to optimize system performance.

ConclusionYou can start with a very simple design andupgrade step by step to a higher perform-ance design by boosting Ethernet through-put, boosting DDR throughput with amultiport memory controller, adding datafiltering, FFTs, and so on.

In the future, it will be necessary moreoften than not to postpone important systemarchitecture decisions to later stages duringthe design process. With an FPGA embed-ded system, you have many opportunities toremove bottlenecks from your system.

For more information, visit www.so-logic.net or www.avl.com.

Further Improvements

• Upgrading the actual design environment to the newest versions of ISE software and EDK 9.2.

• Moving all OPB cores to the PLB bus, because all OPB cores are now also available for PLB. In this case, we

had to decide which shared peripheral is connected to which PowerPC. As a positive consequence, we could

remove the whole OPB bus and two relative large PLB2OPB bridges, each saving 500 slices.

• Running embedded Linux on one or both PowerPCs if you needed to run one of the many applications that

are available for Linux.

• Adding cores for double-precision floating-point to offload the host PC application or integrate more DSP func-

tionality into the FPGA.

• Downgrading to a lower performance system with only one PowerPC or, as an even more low-cost solution,

replacing the PowerPC with the MicroBlaze processor v7.0, which also has a PLB bus interface.

• Upgrading to a higher performance Virtex-5 FXT FPGA with PowerPC 440 and removing the CPLD, because of

the more user I/O pins available on the FPGA in the same package and native boot support from SPI flash.

• Upgrading the FIFO links to serial point-to-point, high-speed links with RocketIO™ transceivers to improve

data transfer speed.

S Y S T E M D E V E L O P M E N T

Page 45: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Ilya TarasovAssociate [email protected]

Processor cores, systems, and designs aregaining ground in the FPGA solution port-folio. Xilinx® MicroBlaze™ andPicoBlaze™ processors, along with theEDK development tool, have paved the wayto processor-based systems on a single chip.

You are not, however, limited to thesesolutions only – you can always imple-ment a soft processor with your desiredfeatures. In this article, which is devoted tosoft processors in general, I’ll attempt todetermine when it is preferable to imple-ment custom soft-core processors insteadof proven Xilinx solutions.

How Custom Processor Design WorksBefore discussing why you might need todesign a custom processor, let’s take alook at the design methodology. I’ll use avery simple example of a processor toshow the main steps and what problemsyou might encounter.

Second Quarter 2008 Xcell Journal 45

Custom Processors in FPGAsCustom Processors in FPGAsYou can build a custom processor with the newest Xilinx FPGAs.You can build a custom processor with the newest Xilinx FPGAs.

S Y S T E M D E V E L O P M E N T

Page 46: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

The heart of the processor is a finite statemachine (FSM). An FSM is quite effectivebecause the latest FPGAs are synchronousby nature; therefore, an FSM is a good basefrom which to describe processor behavior.Being an FSM, the processor must changeits state at the rising edge of the clock signal.

Let’s take a straightforward examplewhere we have three 8-bit registers: PC(program counter); RA and RB (general-purpose registers); and an 18-bit commandbus named “cmd.” This scheme is very effi-cient for an FPGA with 1,024 x 18-bitorganization. Figure 1 shows the behavioralcode for this example.

This example illustrates the implementa-tion of a fully synchronous digital design. Ifyou can provide an appropriate sequence ofcommands for each new program countervalue, you’ll get a programmable statemachine; in other words, a processor.

Of course, you may want to use on-chipFPGA resources for a complete system-on-chip solution without any external memo-ry components. But looking at block RAMwaveforms, you might find it necessary tospend at least one clock cycle to read com-mand from memory. To correct this, let’simplement a two-stage FSM where the firststage is “read” and the second stage is “exe-cute,” shown in Figure 2.

Why (and When) to Opt for a Custom ProcessorDesigners familiar with the MicroBlazeprocessor and EDK software may ask whywe need yet another core, or how it couldreplace proven, high-performance products.Indeed, you don’t need another processor ingeneral if you can perform a common taskusing well-known approaches.

But because the MicroBlaze processor isa 32-bit RISC core, it is possible to achievea significantly different solution. Let’s ana-lyze when these steps might lead to valu-able results.

DSP Algorithms with High ParallelismIn this case, FPGA performance is deter-mined by the structure of hardware multi-plier blocks, which realize certainalgorithms of data processing. There maybe many such data paths with embedded

Non-Symmetrical Register ArchitectureA MicroBlaze embedded processor realizesa symmetrical three-address register archi-tecture. This means that any registers mayperform the same operations (with fewexceptions for MicroBlaze processors),while general commands may chooseoperands and destinations separately.

A three-address architecture with manyregisters is good for compilers, becausemany variables can be loaded into theprocessor without needing to be storedback into memory. If you have fewer regis-ters, some variables may be forced tounload when the number of active variables

multipliers and XtremeDSP™ solutionslices, and no processor core can achievecomparable benchmarks given the highlyparalleled nature of the DSP block array.

You may want to control this array withstandard interfaces and protocols, collectstatistics, and perform additional tasks ineach DSP block. This is a job for processorcores. But you can’t just place the samenumber of MicroBlaze processor cores asembedded multipliers on the FPGA. Forthese purposes, simple processor cores withlittle register sets, short commands, andsmall command memories (even based ondistributed RAM) may prove very useful.

-- register declaration omitted for saving article spaceprocess(clk)begin if rising_edge(clk) then case conv_integer(cmd) is when 0 => pc <= pc + 1; -- no operation, but we must remember to increment program counter when 0x100 to 0x1FF => RA <= cmd(7 downto 0); pc <= pc + 1; -- load imm value to RA when 0x200 to 0x2FF => if conv_integer(RA) = 0 then pc <= cmd(7 downto 0); else pc <= pc + 1; end if; -- jump if RA = 0-- we can add any ‘simple’ commands, which will affect any desired registers when 1 => RA <= RA + RB; pc <= pc + 1;--- etc, with format: when <value> => RA <op> RB; pc <= pc + 1;

-- also, let’s look on the parallel data processing and combined commands when 100 => RA <= RA + 1; RB <= RB – 1; pc <= pc + 1; -- different operations with two registers

-- we may also add other complex commands, which have no direct analogs in common instruction sets

end case; end if;end process;

-- in architecture sectionsignal st : bit; -- state of core: 0 – fetch, 1 – execute;-- after ‘begin’ keywordprocess(clk)begin if rising_edge(clk) then case st is when ‘0’ => st <= ‘1’; -- simple wait for BRAM, go to EXECUTE state when ‘1’ => st <= ‘0’; -- return to FETCH state case conv_integer(cmd) is when 0 => pc <= pc + 1; …. – all of the ‘case’ code from the previous example end case; end case; -- st end if;end process;

Figure 1 – Simple decoder and ALU for custom processor

Figure 2 – Simple processor engine for use with block RAM

46 Xcell Journal Second Quarter 2008

S Y S T E M D E V E L O P M E N T

Page 47: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Second Quarter 2008 Xcell Journal 47

becomes greater than the number of regis-ters. (This is not a problem for less complexalgorithms, when calculations are simpleenough to impose some limitations onprocessor architecture.)

For example, you can use a dedicatedaccumulator to act as a destination registerfor all ALU activity. This allows commandwidth to shrink because you can excludeinformation about destination registers (itis always known). Looking further, it ispossible to realize a zero-operand processor(with a stack architecture like Forth proces-sors or Java machines) with the smallestpossible command width. Code compact-ness is pretty useful in the FPGA worldbecause of the limited amount of on-chipblock RAM.

I/O-Oriented ProjectHere’s another uncommon approach to I/Omodule integration. In generic processorcores, it is necessary to assign some address-es to each I/O port and then write low-leveldrivers that will exchange with these I/Ocores through in and out commands.

But if you implement any high-per-formance-critical I/O cores, you may wantto boost and simplify data exchange withthem. You can also turn to dedicatedbuses for modules with dedicated dataexchange commands, too. This takes addi-tional resources and adds special com-mands, of course, so the result will not be

universal (indeed, you can’t add hundredsof I/O modules this way as you can for anon-chip peripheral bus). However, it maybe an “ad hoc” solution.

Advantages of FPGA-Based ProcessorsThe unlimited reprogrammability ofFPGAs allows you to realize very uncom-mon instructions. This is important,because it is virtually impossible to add anynon-standard, uncommon instruction toASIC processors without ensuring its fur-ther usage. If you don’t meet the marketrequirements for ASICs, you’ll lose timeand money, but an experimental instruc-tion set costs almost nothing with FPGAs.

If an experimental architecture withadditional commands is not required, youcan simply reset the FPGA configurationand download the MicroBlaze processor asa proven, supported solution. Experimentalarchitectures may be very effective for par-ticular purposes, and we can always trythem at the beginning of a new project.Uncommon algorithms, parallel calcula-tions, or operations with ultra-wide num-bers are quite appropriate reasons to usecustom processor architectures.

Xilinx FPGAs have a complex set of fea-tures that make all processor implementa-tions effective. For example:

• The balanced set of resources and reg-ister-rich architecture allow for imple-

mentation of deeply pipelined, register-rich processor cores.

• On-chip, dual-port block RAM storesprogram and data on the same chip tomake FPGAs a system-on-chip. Twoindependent ports also extend memorybandwidth and provide easy download-ing and debugging.

• Distributed RAM in SliceM allows youto create small memory sections, such asregister files, buffers, stacks, and FIFOs.All of these are important parts of aprocessor system and can significantlyimprove performance and functionality.

• The newest Xilinx Spartan™-3ANFPGAs with internal flash memoryallow the implementation of a single-chip system with as much as 11 Mb offlash memory and 72 kb of dual-portSRAM. This is enough to complete aserious embedded system-on-chip.

• High-performance Virtex™-5 FPGAsprovide better results for processor softcores, thanks to their six-input look-uptables (LUTs). In real projects (withALU based on combinatorial logic),this makes it possible to implement theprocessor with fewer logic levels thanwith previous Virtex platform FPGAgenerations (and other FPGAs withfour-input LUTs). In general, whencompared to Virtex-4 devices, proces-sor core designs become 30% faster.

ConclusionNow that designers have access to real hard-ware prototyping devices, it is high time totake a fresh look at the embedded processorworld. With powerful and easy-to-usedevelopment tools like ISE™ software, youcan easily try processor designing. This is apromising field for research teams, quali-fied designers, those who need a specificcomputing platform, and IT enthusiasts.

CTC Inline Group in Russia, a trainingcenter, is now holding training courses onthis topic; additionally, you will find a non-commercial community of those creatingexperimental processor implementationswith FPGAs at www.fforum.winglion.ru(with text in both Russian and English).

Distributed RAM

DSP Core

Datapath

Start/StopCoeffecient Reloading

Statistics

Several RegistersSimple Command Set

Compact Size

Figure 3 – A DSP coprocessor with general-purpose core

S Y S T E M D E V E L O P M E N T

Page 48: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Vasanth AsokanStaff Software EngineerXilinx, [email protected]

Given the rapid growthof embedded processingrequirements, system archi-tects are turning toward multi-processor designs to solve the twinproblems of burgeoning complexity andinadequateness in single processor systems.

With their high logic density andhigh-performance hard blocks, recentgenerations of FPGAs have made power-ful chip multiprocessing (CMP) solutionsa reality. The challenge now lies in howyou can rapidly explore and create designsin this solution space.

Xilinx® Embedded Development Kit(EDK) tools and IP provide the flexibili-ty to create uniquely crafted, customizedmultiprocessing solutions on FPGA logicreal estate that can meet both price andperformance targets. In this article, I’llgive a broad overview of multiprocessingconcepts as they apply to Xilinx solutionsbased on PowerPC and MicroBlaze™embedded processors.

48 Xcell Journal Second Quarter 2008

Designing Multiprocessor SOCsUsing Xilinx EDK tools and IP, you can scale your applications by designing SOCs with multiple processors.

S Y S T E M D E V E L O P M E N T

Page 49: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

ScenariosPerformance and functional partitioningare compelling reasons to design systemswith multiple processors. In general, thereare some commonly encountered scenarioswhere multiprocessing can help:

• Multiple independent functions. Thedesign may have multiple, independentsets of processing tasks. An attractiveway to address this challenge is to cre-ate independent processing modulesdedicated to each processing task,assigning each processing module aunique processor and peripheral set.

• Control or data-plane offload. Onecommon scenario is the presence of adistinct set of real-time (compute- ordata-intensive) and non-real-time tasksthat might cause a non-response insolutions based on a single processor tobe non-responsive. In these cases, youcan dedicate a slave processor to per-form the real-time task in a timelymanner. This leaves the master proces-sor to perform other regular tasks andserve as an interface to thehost system. The masterprocessor also monitors andcontrols the slave processor.The slave processor maycontain specialized func-tions or interfaces, allowingit to meet computationperformance requirements.Some examples of this sce-nario include networkoffload, media processing,and security algorithms.

• Interface processing. Onsystems that act as a bridgefor or switch between mul-tiple interfaces, you candedicate a slave processor tothe processing of data ateach interface, while one ormore master processors per-form higher level bridgingand switching tasks.

• Stream processing. For han-dling stream-oriented com-putation, you can arrange

Apart from the SMP scenario, all otherscenarios are feasible on Xilinx FPGAswith EDK tools. The unique capability ofXilinx processing solutions is the flexibili-ty to customize each of the processingsubsystems to the application require-ments. For example, not all processorsmay need a cache or a floating-point unit.By assigning specific functions to specificprocessors, you can create a tailored solu-tion that meets all design goals.

A Simple and Scalable System ArchitectureAs you can see, a number of use modelsare possible for multiple processors.There are also a large number of possibil-ities for the system architecture.Reconciling a use case to a clean, scalabletopology and architecture can be daunt-ing, so it helps to define a baseline archi-tecture that fits most needs.

Figure 1 illustrates a dual-core architec-ture. This architecture presents a simpleand scalable multiprocessor system defini-tion. You can generate derivative topolo-

processors to act on the data stream ina pipeline fashion. Each peer stage inthe multiprocessor pipeline acts on oneportion of the computation beforepassing it to the next processor. This isan effective way to increase systemthroughput.

• Reliability and redundancy. You canreplicate processing systems multipletimes to provide reliability andredundancy.

• Symmetric processing. Traditional sym-metric processing (SMP) is a usefulsolution with which you can scale up(by adding more processors) the per-formance of applications that do notpossess clean partition boundaries. AnSMP-capable OS layer manages paral-lel tasks and automatically schedulesthem across multiple processors. TheSMP use model cannot be applied toXilinx processors, however, becausethey lack cache coherency, a require-ment for implementing SMP.

Second Quarter 2008 Xcell Journal 49

XPS Block RAM

Private Boot Memory

XPS Block RAM

Shared Block RAM Memory

MPMC(External Memory)

Shared External

Memory

XPS Block RAM

Private Boot Memory

XCL

XCL

PLB

LL

PLB

PLB

LL

Free

PLB_v46 BRIDGE

PPC405

IPLB1

DPLB1

IPLB0

DPLB0

MicroBlazeProcessor

IXCL

DXCL

IPLB

DPLB

XPS_Mailbox Core

Send FIFO

Receive FIFO

XPS_Mutex Core

Mutex Array

PLBv46Port B

PLBv46Port A

PLBv46Port B

PLBv46Port A

PLBv46Port B

PLBv46Port A

PLBv46Port B

PLBv46Port A

Peripherals for

Processor 1

Peripherals for

Processor 2

Trimode Ethernet

MAC

XPS_INTC

PCI Express

XPS_INTC

PLBv46 PLBv46

Figure 1 – Dual processor architecture

S Y S T E M D E V E L O P M E N T

Page 50: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

gies, starting from this definition, to meetdesign constraints or challenges. The keyconcepts of this architecture are as follows:

• The architecture is a simple extensionof two completely independent singleprocessing systems, formed by linkingthe systems together with communica-tion components.

• The shared components are all multi-(or dual-) ported in nature. The multi-ported nature of these componentsallows each processor’s system bus tobe independent of the other, both interms of static as well as dynamic load.By isolating each processing subsys-tem, you ensure that the system bus isnot locked out for a processor orperipheral because of a transactionexecuted for another processor. Allmultiported peripherals arbitrateaccesses on various ports internally.

• The key shared peripheral is the multi-ported memory controller (MPMC).MPMC provides access to externalmemory through different port inter-faces. Multiple processors can connectto MPMC through independent ports.This topology allows both thePowerPC and MicroBlaze processors tosimultaneously access external memorywith minimal latency and high band-width. MPMC currently provides amaximum of eight ports, thus allowingas many as three to four processors toconnect to a single external memory.

• The architecture also shows sharing ofinternal block RAM memory betweenprocessors. Sharing on-chip blockRAM can be an extremely fast way topass kilobyte-sized data between theprocessors. Accesses to block RAMscan also be deterministic – an impor-tant requirement in some applications.

• Apart from shared memory, there aretwo other cores – the XPS Mailbox andXPS Mutex – that provide simple formsof interprocessor communication. TheXPS Mailbox core provides a low-laten-cy, FIFO-style message-passing interfacebetween the two processors synchro-

nously or asynchronously. It can be usedfor either directly passing messages or forpassing pointers to messages stored inshared memory. You can use the XPSMutex core for arbitrating accesses toshared resources (whether they are on-chip or off-chip) between software onthe two processors. Together, these coresallow you to build cooperating softwareprograms on each processor.

• Some systems might wish to shareperipherals that are not multiported (likea UART or SPI or I2C). Such a situationrequires providing a system bus bridgefrom the bus that does not connect tothe peripheral to the bus that does con-nect to the peripheral. Figure 1 shows theuse of a bus bridge to share a UARTbetween the two processors.

• Figure 1 intentionally shows a PowerPC405 processor as the first and aMicroBlaze processor as the second toillustrate certain specific characteristics ofeach processor. However, any one proces-sor can be equivalently replaced by theother with very minimal adaptation.Thus, this architecture can be seamlesslytransitioned between processors.

Although Figure 1 illustrates the recom-mended overall multiprocessing architec-ture, various constraints may require you tofurther refine the architecture. For instance,in a system in which logic area and resourceusage are a key concern, all of the processorscould be connected to the same system bus.Although this makes the system less deter-ministic and increases run-time load on thebus, it offers area savings by eliminating anew system bus as well as removing the needfor multiple ports on IPs.

Other derivative architectures are alsopossible, such as having the high-perform-ance processor on a separate system busand multiple low-performance processorson a shared system bus. You can also createhierarchical topologies by connecting pro-cessing subsystems to each other throughmultiple levels of bridges. The varioustools and IP provided in EDK allow you tofurther refine this base topology down tosomething that exactly fits your needs.

Other ConsiderationsYou will typically apply a few other con-siderations to your multiprocessingarchitecture. For instance, you will needto define memory maps in a non-con-flicting manner between the two proces-sors. The automatic address generationtools in EDK simplify this task down toa push of a button.

You will also need to give somethought to your clocking and reset net-works. You will have an option to clockall of the processors at the same rate oruse different clocking domains for eachprocessor. Similarly, you can define resetdomains at various levels such as proces-sor-only reset, processor subsystem reset,and system reset. The processors mustalso be independently connected to adebugger port, thus allowing separatedebugging sessions for each processor.

Beyond the hardware considerations,you will also be designing your softwaresystems such that they can work cooper-atively. This includes using shared mem-ory, message passing, and commonsynchronization concepts such as barri-ers and rendezvous so that the systembehaves in a predictable and synchro-nized fashion. Commercial softwarestacks are also available that can providehigher level communication paradigms.

ConclusionFor a more details about the possibilitiesof multiprocessor systems, a longer whitepaper version of this article is available at:www.xilinx.com/support/documentation/white_papers/wp262.pdf. Also consider thereference designs described and providedin Xilinx application note XAPP996,“Dual Processor Reference Design Suite,”at www.xilinx.com/support/documentation/application_notes/xapp996.pdf.

Stay tuned for new features in futureEDK tools, such as support for auto-mated multiprocessor design creationand cooperative debugging. The XilinxVirtex™-5 FXT platform, with thepowerful PowerPC 440 processor, alsoopens up endless possibilities for creat-ing ultra-high performance multi-processor systems.

50 Xcell Journal Second Quarter 2008

S Y S T E M D E V E L O P M E N T

Page 51: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Hit your target by advertising your product or service in the Xilinx Xcell Journal, you’llreach thousands of qualified engineers, designers, and engineering managers worldwide.

The Xilinx Xcell Journal is an award-winning publication, dedicated specifically to helping programmable logic users – and it works.

We offer affordable advertising rates and a variety of advertisement sizes to meet any budget!

Call today : (800) 493-5551 or e-mail us at [email protected]

Join the other leaders in our industry and advertise in the Xcell Journal!

Let Xcell Publications help you get your message out to thousands of

programmable logic users.

Get on Target

Rwww.xilinx.com/xcell/

Page 52: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Tom HillSr. Marketing Manager, DSPXilinx, [email protected]

Along with next-generation video com-pression standards, the industry shiftfrom basic video processing to morecomplex and integrated processing solu-tions are driving system requirements forvideo performance beyond what stand-alone DSPs can deliver. FPGAs such asthe Xilinx® Spartan™-3A DSP fill thisgap for cost-sensitive military, automo-tive, medical, consumer, industrial, andsecurity applications by providing morethan 20 GMACs of DSP performancefor less than $30. FPGAs uniquely pro-vide logic, embedded processing, OSsupport, and drivers to offer a completeend-to-end solution for video.

It is not that developers lack under-standing about the performance benefits ofFPGAs that prevent their use in videoapplications; rather, it’s a lack of experiencewith the design flow. This is especially truefor traditional DSP program developersaccustomed to programming in C.

You can achieve FPGA performancegains by exploiting the flexibility of thedevice to configure a hardware architectureoptimized for a particular application.This flexibility adds a degree of freedom tothe development process that also con-tributes to its complexity.

The XtremeDSP™ Video Starter Kit(VSK) provides a complete and easy-to-usedesign environment. Example applicationsand full support for standard Xilinx toolflows help accelerate the design process yetstill allow for end-product differentiation.

Introducing the XtremeDSP VSK – Spartan-3A EditionThe XtremeDSP Video Starter Kit –Spartan-3A Edition is a video developmentplatform comprising the Spartan-3A DSP3400A development platform, the FMC-video daughtercard, and a VGA camera.

The Spartan-3A DSP 3400A develop-ment platform, which you can purchaseseparately, is built around the Spartan-3ADSP XC3SD3400A device. This deviceprovides 126 embedded DSP blocks forimplementing coprocessing and high-per-formance video processing systems.

The FMC-video daughtercard extendsthe video I/O capabilities of the Spartan-3ADSP 3400A development platform byincluding the following additional interfaces:

• DVI-I input, both digital and analog

• Composite input and output

• S-video input and output

• Two camera inputs

Video Development ToolsYou can create video applications for theVSK without RTL knowledge or experi-ence using the Xilinx EmbeddedDevelopment Kit (EDK) and SystemGenerator for DSP. EDK is a comprehen-sive solution for designing embeddedprogrammable systems and includes thePlatform Studio tool suite, embedded IPcores, and the MicroBlaze™ embeddedprocessor.

System Generator for DSP enables theuse of The MathWorks Simulink/MAT-LAB modeling environment for FPGAdesign by providing a Simulink blockset ofmore than 100 Xilinx-optimized DSPbuilding blocks.

52 Xcell Journal Second Quarter 2008

Accelerating Video Development on FPGAs Using the Xilinx XtremeDSP Video Starter Kit

Accelerating Video Development on FPGAs Using the Xilinx XtremeDSP Video Starter KitTry a video development methodology that is processor-friendly, generates highly optimized results, and does not require VHDL or Verilog knowledge.Try a video development methodology that is processor-friendly, generates highly optimized results, and does not require VHDL or Verilog knowledge.

S Y S T E M D E V E L O P M E N T

Page 53: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Developing Video Applications Using the Base PlatformAn embedded system called the base plat-form provides the framework from whichyou can develop video applications usingthe VSK. The base platform is an embeddedsystem created using the Xilinx PlatformStudio base system builder (BSB) andincludes a MicroBlaze embedded processor.

This framework provides a startingpoint for new designs or serves as an easymigration path for existing applicationsdeveloped on processor-based systems. Youcan recompile any C code created for exter-nal processors on the MicroBlaze processorwith minimal effort; once ported, the high-performance video chains can migrate fromsoftware to the FPGA fabric.

To assist in this migration, the VSKincludes an IP library of custom peripher-als that you can easily add to the base sys-tem using Platform Studio. You canconnect to the video interfaces, managedata frames, and perform memory accessand basic video processing. These customperipherals include:

• DVI in

• DVI out

• Camera

• Video frame buffer controller (VFBC)

• Video processing pipeline

video applications running on XilinxFPGAs. Each reference design is built onthe base platform and uses customperipherals from the VSK IP library.Table 1 lists each reference design and thevideo processing and connectivity capa-bilities that it illustrates. These referencedesigns are intended to serve as a startingpoint from which further developmentmay occur. Figure 2 shows how the DVIpass-through reference design interfacesinto the base system.

The Xilinx VFBC is ideal for video appli-cations where the hardware control of two-dimensional data is required to achievereal-time operation. This is typical of motionestimation, video scaling, on-screen displays,and video capture used in video surveillance,video conferencing, and video broadcasts.

Jump-Start Development Using VSK Reference DesignsThe VSK provides three reference designsfor jump-starting the development of

Block

RAM

MDM

DLMB

DXCL

IXCL

PLB

XCL

DDR2

MicroBlaze

ProcessorUART

GPIO

LEDs

PLBARB

GPIO

Push Buttons

GIP

DIP Switches

MPMC XPS IICXPS IICSystem ACE

Technology

Processor

System Reset

Clock

Generator

ILMB

Reference Design Functionality Description

DVI Pass-Through • Capturing a video stream from the input port• Performing real-time image processing on the video stream• Displaying the processes video

DVI Frame Buffer • Capturing a video stream from a DVI source• Buffering the video stream in external memory• Displaying the buffered video• Reporting memory bandwidth utilization data

Camera Frame Buffer • Capturing a video stream from a camera• Performing processing on the video stream• Buffering the video stream in external memory• Displaying the processed video at a different rate• Using a microprocessor to configure various aspects of the video pipeline

Table 1 – VSK reference design summary

Figure 1 – Base platform block diagram

Second Quarter 2008 Xcell Journal 53

S Y S T E M D E V E L O P M E N T

Page 54: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Using Model-Based Design to Create Video ApplicationsAccelerating video applications on FPGAsrequires that performance-critical opera-tions migrate from software running onprocessors to hardware. The VSK supportsa variety of hardware design flows. Thisincludes flows that leverage strong hardwaredesign backgrounds using VHDL/Verilogand flows that accommodate little or nohardware design experience by leveragingmore abstract modeling environmentsincluding C, MATLAB, and Simulink.

Simulink from The MathWorks is amodel-based design environment that youcan use to develop algorithmic models ofvideo systems. The MathWorks provides anoptional video and imaging blockset forSimulink that provides a rich set of videobuilding blocks for easily processingstreaming video and visualizing the resultsat each step in the model.

You can initially model the video pro-cessing algorithm itself abstractly usingfloating-point data types and high-levelvideo and imaging blocks, refining thealgorithm as you consider the trade-offsassociated with complexity, system cost,and performance.

System Generator for DSP enables theuse of Simulink for Xilinx FPGA designs by

providing a rich set of DSP building blocks,optimized for Xilinx devices. Tight integra-tion allows DSP designs captured in SystemGenerator to be converted into customperipherals for Platform Studio and con-nected to the base system using the proces-sor local bus or Fast Simplex Link bus.

Figure 3 shows an example of a cameravideo processing pipeline created usingSystem Generator, which is included in thecamera frame buffer reference designsshipped with the VSK.

System Generator supports hardware-in-the-loop co-simulation using the Spartan-3ADSP 3400A development platform. Youcan use this platform to accelerate the per-

formance of Simulink simula-tions up to 1,000x. This accel-eration enables video algorithmdevelopment and debuggingusing real-time video streamsread into Simulink throughThe MathWorks data acquisi-tion toolbox.

ConclusionThe XtremeDSP Video StarterKit – Spartan-3A DSP Editionkeeps development costs low byproviding a complete videodevelopment solution for under

$1,600. The DSP and embedded designtools included enable rapid FPGA develop-ment of the video system without requiringRTL design experience.

DSP-optimized FPGA platforms suchas the Virtex-5 SXT device, Virtex-4 SXdevice, and Spartan-3A DSP are particu-larly well suited to meet both the cost andperformance requirements of high-per-formance video and image processingapplications in security, broadcast, indus-trial, consumer, medical, and automotiveapplications – while insulating productsagainst early obsolescence.

For more information, visit www.xilinx.com/s3adsp_vsk.

54 Xcell Journal Second Quarter 2008

Block

RAM

MDM

DLMB

PLB

XCL

MicroBlaze

ProcessorUART

GPIO

LEDs

PLBARB

GPIO

Push Buttons

GIP

DIP Switches

MPMC XPS IICXPS IICSystem ACE

Technology

Processor

System Reset

Clock

Generator

Gamma_in 2D FIR

FilterGamma_outDE_GENDVI_IN DVI_OUT Video OutVideo In

Developed Using System Generator

ILMB

Figure 3 – System Generator diagram for a camera video processing pipeline

Figure 2 – Base system with video pipeline

S Y S T E M D E V E L O P M E N T

Page 55: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

By Tom R. Halfhill Senior AnalystIn-Stat/Microprocessor [email protected]

Xilinx is upgrading its MicroBlaze embed-ded-processor core again, this time addingan optional memory-management unit(MMU) that allows the 32-bit processor torun sophisticated operating systems sup-porting virtual memory. Developers canalso substitute a simpler memory-protec-tion unit (MPU) or omit supervised mem-ory management altogether.

The first full-fledged operating systemannounced for the new MicroBlaze v7 isLynuxworks BlueCat Linux. Until now,MicroBlaze processors were limited to sim-pler embedded operating systems that don’tsupport virtual memory or memory protec-tion. With its optional MMU or MPU,MicroBlaze v7 is suitable for a wider rangeof embedded applications requiring greatersecurity and reliability.

MicroBlaze v7 has other improvementsas well. New instructions provide fasterfloating-point performance and better I/Owith coprocessors and custom logic. Inaddition, Xilinx has upgraded theCoreConnect interface to the latestCoreConnect Processor Local Bus (PLB)v4.6 specification, which provides fasterlinks to on-chip peripherals.

All these improvements strengthenMicroBlaze’s position against competing

processor cores intended for synthesis inFPGAs – primarily Altera’s Nios II andARM’s new Cortex-M1. Those rivalprocessors don’t have MMUs or MPUs.MicroBlaze v7 is available now as part of a$495 development kit. (An extra $100buys a development board with a XilinxSpartan-3E FPGA.) That price includes aMicroBlaze v7 license, and developers oweno royalties when deploying the processorin Xilinx chips.

Robust Memory Management Xilinx has steadily improved MicroBlazesince introducing the soft processor in 2001.Two years ago, Xilinx began offering anoptional FPU. (See MPR 5/17/05-02,“MicroBlaze Can Float.”) Last year, Xilinxlengthened the instruction pipeline to permithigher clock speeds. (See MPR 11/13/06-01,“Xilinx Revs Up MicroBlaze.”) And earlierthis year, Xilinx released MicroBlaze v6,which added a few other minor enhance-ments. Now, with MicroBlaze v7, Xilinx isintroducing big-league memory manage-ment, which significantly expands the rangeof embedded applications for whichMicroBlaze is suited.

With an MMU, MicroBlaze can runfull-fledged operating systems based on theLinux 2.6 kernel. Windows CE is anotherpossibility, although Xilinx hasn’tannounced a deal with Microsoft yet.Processors that support virtual memorycan run powerful operating systems, larger

application programs, and multiple pro-grams while avoiding memory collisionsthat would compromise security and relia-bility. In addition, virtual memory allows asystem to work with less physical memory,which can reduce costs and power con-sumption. As small embedded systemsincreasingly tackle heavy-duty applicationsonce limited to bigger systems, sophisticat-ed memory management is becomingalmost a necessity.

Of course, many embedded systemsdon’t need this level of memory manage-ment, so the MicroBlaze MMU is optional.It’s just another configuration option avail-able when MicroBlaze developers synthesizethe processor using the Xilinx developmenttools. Another option is to implement anMPU, which provides memory protectionwithout virtual memory and address trans-lation. An MPU is appropriate for embed-ded systems that must shield a program’smemory region against accidental or mali-cious intrusions by other programs. Yetanother option is to implement privileged-mode execution without memory protec-tion or virtual memory. In privileged mode,only the operating system or a privilegedapplication program can execute certaininstructions critical to system security.

Table 1 shows how each option (MMU,MPU, or privileged execution) affects thesize of the synthesized processor, as meas-ured by the number of additional lookuptables (LUT) required to implement the

Second Quarter 2008 Xcell Journal 55

MicroBlaze v7 Gets an MMUMemory Manager Brings Full-Fledged Linux to Xilinx Processor Core

First published in Microprocessor Report, November 13, 2007. Copyright 2007 by In-Stat, a unit of Reed Business Information. Reprinted with permission.

PLATFORMS, PROCESSORS & TOOLS

Page 56: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

feature in the FPGA’s programmable-logicfabric. Implementing privileged mode alonerequires so few LUTs that it’s practically ano-brainer. The other options demandmore forethought. In particular, the MMU– which requires about 1,000 LUTs – willaccount for approximately one-third of afully configured MicroBlaze v7 core. (Toput that size in perspective, the Spartan-3E1600E chip on a MicroBlaze v7 develop-ment board has about 33,000 LUTs andcosts $10 to $15, depending on volume.)

The full-blown MMU is the largestmemory-management option, partlybecause it needs a translation lookasidebuffer (TLB) to cache a subset of the tablethat translates virtual and physical memoryaddresses. MicroBlaze v7 has a 64-entryunified TLB. To supplement this software-managed buffer, there are shadow entriesfor instruction-memory pages and data-memory pages. The number of shadows isuser configurable: one, two, four, or eightentries for instructions and the same fordata. (The default configuration is twoshadows for instructions and four shadowsfor data.) The processor automaticallymanages the shadows, which prevent theTLB from thrashing. Memory pages canrange in size from 1KB to 16MB, andmixed sizes are allowed. With 32-bit effec-tive addressing, MicroBlaze v7 can addressup to 4GB of flat memory.

The MicroBlaze MMU is patterned afterthe one in an IBM Power 405 processor.That’s no coincidence. Some Virtex familyFPGAs integrate a hardened Power 405 core,which is much faster than a MicroBlazeprocessor synthesized in the fabric. Having asimilar MMU brings a few advantages toMicroBlaze v7. First, programmers will have

the maximum theoretical bandwidth of a128-bit PLB would be 8.8MB/s.

The PLB connects directly to the CPUand provides a multidrop bus shared byseveral on-chip peripherals. It supplementsa proprietary Xilinx interface called theFast Simplex Link (FSL). An FSL is a directpoint-to-point interface, not a multidropbus. FSLs are faster than shared buses butrequire more logic gates per I/O interface.An SoC design can use one or more FSLsin combination with a CoreConnect PLBfor different purposes, giving developers awealth of options.

Figure 1 shows an example of an SoCimplemented in a Xilinx FPGA with aMicroBlaze or Power 405 processor core.In this example – a TCP/IP packet proces-sor – the most critical datapaths link theEthernet controller to the external-memorycontroller and the CPU. Those datapathsare FSLs, which can be 32 to 128 bits wide.Less critical components share aCoreConnect PLB. The Xilinx develop-ment tools can automatically configure anFSL for a particular purpose or allow devel-opers to configure the interface manually.

With IBM’s blessings, Xilinx has slightlymodified the standard CoreConnect IP.These modifications were necessary becausethe programmable-logic gates of FPGAsaren’t as efficient as the standard-cell gates ofASICs, especially when routing signals acrossa large chip. Datapaths and clock trees tendto stretch much longer in FPGAs, makingtiming closure more difficult. The problemgets worse in complex designs that distributenumerous peripherals throughout the chipon a shared bus. Consequently, Xilinx hasmodified the PLB to be more synchronousand to eliminate indeterminate data bursts.Xilinx says these changes, though relativelyminor, allow developers to attach 10 or 20peripherals to the CoreConnect PLB with-out timing problems.

In addition, Xilinx has modified thePLB to work more efficiently with hard-ened transceivers built into some Virtex-5FPGAs. These transceivers are a PCIExpress endpoint and a trimode Ethernetmedia-access controller (TEMAC), whichdeliver much better performance thanwould equivalent soft-IP controllers synthe-

an easier time porting virtual-memory oper-ating systems from the Power Architecture toMicroBlaze. Second, developers should findit easier to create a multicore design thatmates one or more MicroBlaze cores with aPower 405 in a shared-memory configura-tion. And third, the Power-style MMU pre-pares MicroBlaze v7 for possible futurearrangements with IBM to integrate newerPower cores into Xilinx FPGAs.

Faster CoreConnect Bus CoreConnect is IBM’s on-chip bus forSoCs, introduced in 1999. (See MPR7/12/99-03, “PowerPC 405GP Has Core-Connect Bus.”) Although IBM createdCoreConnect mainly for its own PowerArchitecture processors, anyone can freelylicense CoreConnect as synthesizable intel-lectual property (IP), and it’s not specific toa particular CPU architecture. Over thepast eight years, soft-IP vendors have mademany of their licensable peripheral corescompatible with CoreConnect. The onlyon-chip bus supported more widely isARM’s AMBA.

For greater efficiency, CoreConnect sep-arates low-speed and high-speed peripher-als on separate buses joined by a bridge.Until now, MicroBlaze supported only theslower On-chip Peripheral Bus (OPB),which has a 32-bit-wide datapath.MicroBlaze v7 still supports the OPB forbackward compatibility but adds supportfor the faster Processor Local Bus (PLB).The PLB’s datapaths are configurable dur-ing synthesis for 32-, 64-, or 128-bitwidths. Bus bandwidth depends on thewidth of these datapaths and the FPGA’sclock frequency, which can reach 550MHzin the fastest Virtex-5 devices. At 550MHz,

Memory Manager Type LUTs (Virtex-5) LUTs (Spartan-3)

Memory Management Unit (MMU) 910 1,100

Memory Protection Unit (MPU) 560 670

PrivIieged Mode Only 34 38

Table 1 – Sizes of three MicroBlaze v7 memory-management options. Each option occupies additional lookup tables (LUT) in the FPGA, enlarging the processor core. (A fully configured core requires about 3,000 LUTs.) Note that the sizes for these options are slightly different in Xilinx Virtex-5 and Spartan-3 FPGAs, because the LUTs are different: Virtex-5 LUTs have

six inputs, whereas Spartan-3 LUTs have four inputs.

56 Xcell Journal Second Quarter 2008

PLATFORMS, PROCESSORS & TOOLS

Page 57: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Second Quarter 2008 Xcell Journal 57

sized in the fabric. The PCI Express end-point is fully buffered and supports 1, 2, 4,or 8 lanes. The TEMAC transceiver supports10Mb/s, 100Mb/s, and Gigabit Ethernetrates. In a Xilinx benchmark test, a packetprocessor based on MicroBlaze v7 and theTEMAC achieved 750Mb/s raw throughput– an impressive 75% utilization of the trans-ceiver’s maximum theoretical bandwidth.

Although AMBA enjoys wider supportthan CoreConnect does, the latter standardis better for MicroBlaze. The Power 405processor built into some Virtex familyFPGAs uses CoreConnect, so it’s easier fordevelopers to create asymmetric multi-processors around the Power 405 andMicroBlaze cores. Many peripheral-IPcores from third-party vendors work witheither AMBA or CoreConnect, usually byadding a simple gasket adapter.

New Instructions Boost Performance Xilinx has added 11 new instructions toMicroBlaze v7: three for floating-point

operations, and eight for use with FSLs.The new floating-point instructions arestraightforward. One instruction, FSQRT,calculates the square root of a 32bit float-ing-point value in either 27 or 29 clockcycles, depending on whether theMicroBlaze processor is configured with athree- or five-stage pipeline. (The deeperpipeline is faster.) Without using FSQRT,the same operation performed by a func-tion call to a software library would takeabout 500 cycles.

The other two new floating-pointinstructions convert integers to floats orvice versa. The FLT instruction converts a32-bit integer into a 32-bit float in fouror six clock cycles, depending on thepipeline depth. Calling the same functionin software would take 330 cycles.Conversely, the FINT instruction con-verts a 32-bit float into a 32-bit integer infive or seven clock cycles, depending onthe pipeline depth. A software functioncall would take 88 cycles.

Eight new instructions improve I/Owhen a coprocessor is connected to aMicroBlaze core via an FSL. These instruc-tions take the form of PUT and GET oper-ations, and they allow programs to managethe coprocessor I/O as blocking or non-blocking operations. In a blocking relation-ship, the CPU stops processing otheroperations until it handles a coprocessor’srequest for attention. In a non-blockingrelationship, the CPU continues processingother operations while the coprocessor’srequests enter a FIFO buffer. The CPU isn’tblocked unless the buffer fills. Developerscan configure the size of the buffer accord-ing to the coprocessor’s needs.

In addition, MicroBlaze v7 can accom-modate twice as many FSL interfaces asbefore (16 vs. 8), and programs can dynam-ically assign coprocessors to individual FSLinterfaces at run time. Previously, coproces-sor assignments to FSLs were hard-codedinto the user’s software. Any changesrequired developers to recompile the soft-ware. Dynamic assignments allow develop-ers to write software that adapts tochanging conditions and workloads. Forexample, a developer could create precom-piled software libraries that select theirappropriate coprocessors at run time,according to the custom hardware in thecoprocessors and the tasks to be performed.Multimedia-acceleration libraries could runon coprocessors specializing in fast Fourier

Price & Availability

MicroBlaze v7 is available now as part

of Xilinx Embedded Development Kit 9.2, for

$495. The kit includes MicroBlaze configuration

tools, Eclipse-based software-development tools,

and other software, plus documentation.

For $595, the kit comes with a development

board, a Spartan-3E 1600E FPGA, and a

JTAG probe. Xilinx doesn’t charge royalties for

deploying MicroBlaze designs in Xilinx FPGAs.

For more information, see:

www.xilinx.com/microblaze

Local Memory CustomCoprocessors

MicroBlaze or PowerPC Debug

MultiportMemory

Controller DMA

Ethernet MAC

PCIe

General PeripheralController

InterruptController

Timer/PWM

UART

I2C/SPI

GPIO

CAN/MOST

USB 2.0

Virtex or Spartan FPGA

PLBv46

Custom I/OPeripherals

Figure 1 – Example SoC block diagram. This packet processor uses Xilinx Fast SimplexLinks (FSL) for critical datapaths and a shared IBM CoreConnect Processor Local Bus

(PLB) for other on-chip peripherals. MicroBlaze v7 is the first version of the core to supportthe CoreConnect PLB; previous MicroBlaze cores supported only the slower CoreConnect

On-chip Peripheral Bus (OPB). The multiport memory controller is special Xilinx IP withbuilt-in DMA; it’s compatible with DDR1 and DDR2 SDRAM.

PLATFORMS, PROCESSORS & TOOLS

Page 58: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

transforms (FFT) or finite impulse-response (FIR) filters.

Table 2 shows the results of porting apacket-processor design from MicroBlaze v6to v7 – throughput improved more than 3x,from about 70Mb/s to more than 250Mb/s.However, notice that this comparison (con-ducted by Xilinx) doesn’t isolate the effect ofeach variable changed in the design. In par-ticular, the Ethernet controller in theupgraded design is much faster.Nevertheless, the comparison demonstrateswhat is possible. Increasing the maximumtheoretical bandwidth of a system doesn’tguarantee higher throughput, and one fea-ture of MicroBlaze v7 is better CoreConnectsupport for the TEMAC hard-wired intoVirtex-5 LXT chips.

New Kid on the Block: ARM MicroBlaze v7 is the second new version ofthe processor that Xilinx has introducedthis year, and the third since 2006.Although these steps have been incremen-tal, they add up to a significantly betterprocessor. The quickening pace ofimprovement may be due to the arrival offresh competition: ARM’s Cortex-M1.

The Cortex-M1 is the first ARMprocessor core sanctioned for deployment

in FPGAs and optimized for their pro-grammable-logic fabrics. Previously, ARMallowed licensees to test their designs inFPGAs but not to deliver finished designsin the chips. ARM’s course change wasprompted partly by the rising costs ofdesigning and manufacturing ASICs, andpartly by the popularity of the XilinxMicroBlaze and Altera Nios II cores.(Xilinx and Altera have sold tens of thou-sands of licenses for their processors.) TheCortex-M1 is a major new developmentthat alters the competitive landscape. (SeeMPR 3/19/07-01, “ARM Blesses FPGAs.”)

The first FPGA vendor to announcesupport for the Cortex-M1 was Actel, amuch smaller company than Altera orXilinx. Actel has a special arrangement withARM to sell Cortex-M1 FPGAs withoutrequiring customers to purchase an ARMlicense or pay chip royalties to ARM. Thisdeal dramatically lowers the cost of deploy-ing an ARM-based design. Xilinx hasn’tannounced a similar arrangement yet, butMPR suspects it’s a possibility in the future.Although the CortexM1 and MicroBlazeprocessors would seem to make rivals ofARM and Xilinx, their relationship remainsmore cooperative than competitive. ARMunderstands that MicroBlaze is primarily a

loss-leader product that Xilinx created tosell more FPGAs. A MicroBlaze v7 licensecosts only $495, so the chips – not thelicenses – are the real moneymakers. Xilinxis happy to see customers buying its FPGAsto use with the Cortex-M1, too.

Even so, while ARM and Xilinx cordial-ly shake hands, MicroBlaze v7 is slappingthe Cortex-M1 upside the head. Thebrand-new ARM processor suffers in com-parison with the Xilinx loss leader.Although MicroBlaze v7 is priced at a pit-tance, it’s embarrassingly rich in featuresmissing from the Cortex-M1, such as anoptional FPU, MMU/MPU, 32-bitdivider, and instruction/data caches. Ontop of that, MicroBlaze can reach higherclock frequencies than the Cortex-M1 can.ARM’s biggest selling point is that theCortex-M1 is from ARM. The ARM archi-tecture is almost an industry standard, andit’s supported by tons of peripheral IP,development tools, and software.

As Table 3 shows, Altera’s Nios II is acloser match for MicroBlaze, even thoughit hasn’t been significantly upgraded since2004. (See MPR 6/28/04-02, “Altera’s NewCPU for FPGAs.”) The addition of anoptional MMU/MPU gives MicroBlaze v7its first big advantage over Nios II.However, Altera retains one advantage: auser-configurable instruction-set architec-ture. Nios II developers can create custominstructions to accelerate specific applica-tions, which can dramatically improve per-formance. To achieve similar results,MicroBlaze developers can implementcoprocessors in the programmable-logicfabric. (Coinciding with the online publi-cation of this article on November 13,Altera is announcing an arrangement withSynopsys that will make it easier for devel-opers to move Nios II-based designs fromFPGAs to standard-cell ASICs. MPR plansto cover this development in the future.)

Note that the price gaps among theseprocessors are shrinking. Before theCortex-M1, the difference between licens-ing a processor core from an FPGA vendoror licensing one from ARM was four ordersof magnitude: about $500 for a MicroBlazeor Nios II versus millions of dollars for anARM. With the Cortex-M1, ARM has

Feature

TCP/IP Stack

On-ChipInterconnect

ExternalMemory Controller

Ethernet MAC

Xilinx FPGA

Packet Throughput

MicroBlaze v6Embedded Dev Kit 9.1

LwIP

CoreConnect On-chip Peripheral Bus

(OPB)

XilinxMCH_OPB_DDR

Fast Ethernet(10–100 Mb/s)

Virtex-5 LX

~70 Mb/s

MicroBlaze v7Embedded Dev Kit 9.2

Treck

CoreConnect Processor Local Bus

(PLB v4.6)

Xilinx MultiportMemory Controller

Gigabit Ethernet Trimode MAC

(10–100–1,000 Mb/s

Virtex-5 LXT

>250 Mb/s

Table 2 – MicroBlaze v6 versus MicroBlaze v7 performance. This comparison is based on a packet-processor design pitting the new Micro-Blaze core against the older version. The upgradeddesign is more than three times faster. However, Xilinx also changed other variables, including theTCP/IP stack, memory controller, and Ethernet controller, which clouds the comparison somewhat. In particular, the network interface leaps from a soft-IP Fast Ethernet MAC (100Mb/s) to a hard-

wired Gigabit Ethernet Trimode MAC (10–100–1,000Mb/s).

58 Xcell Journal Second Quarter 2008

PLATFORMS, PROCESSORS & TOOLS

Page 59: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Second Quarter 2008 Xcell Journal 59

Feature

Architecture

PrimaryFPGA Targets

Config. ISA

Pipeline Depth

I-Cache

D-Cache

Local Memory

32-Bit Multiplier

32-Bit Divider

Barrel Shifter

FPU

Branch Predict

Priv. Levels

CoprocessorInterface

On-ChipInterconnect

MemoryManagement

TranslationLookasideBuffer (TLB)

Core Freq (Max)

Int. Perf (Max)

FP Perf (Max)

FPGALogic Cells

Introduction

Price

XilinxMicroBlaze v7.0

MicroBlaze

Virtex-5Spartan-3

3 or 5 stages

0–64K

0–64K

0 or 2256K each

Optional

Optional

Optional

Optional 32 bits (New instructions)

1 or 2

FSL(New instructions)

CoreConnectPLB v4.6

OptionalMPU or MMU

Optional8-entry I + D

64-entry unified

220 MHz †

240 Dmips

50 MFLOPS

980–3,000

Nov 2007

$495

XilinxMicroBlaze v6.0

MicroBlaze

Virtex-5Spartan-3

3 or 5 stages

0–64K

0–64K

0 or 2256K each

Optional

Optional

Optional

Optional 32 bits

1

FSL

CoreConnectOPB

220 MHz †

240 Dmips

50 MFLOPS

960–1,700

Mar 2007

$495

AlteraNios II/f

Nios II

Stratix, Cyclone,HardCopy

Yes

6 stages

0–64K

0–64K

0–8Configurable

Optional

Optional

Optional

Optional 32 bits

Dynamic

2

Avalon-MM

Avalon-MM

205 MHz

225 Dmips

n/a

1,800

2004

$495

AlteraNios II/s

Nios II

Stratix, Cyclone,HardCopy

Yes

5 stages

0–64K

0–64K

0–4Configurable

Optional

Optional

Optional

Optional 32 bits

Static

2

Avalon-MM

Avalon-MM

165 MHz

127 Dmips

n/a

1,150

2004

$495

AlteraNios II/e

Nios II

Stratix, Cyclone,HardCopy

Yes

1 stage*

Optional 32 bits

2

Avalon-MM

Avalon-MM

200 MHz

31 Dmips

n/a

600

2004

$495

ARMCortex-M1

ARMv6-MFusion, ProASIC-3,Stratix, Virtex-4/5Cyclone, Spartan

3 stages

0 or 21K–1,024 K each

Yes, 2 options

Yes

1

AMBA-Lite

AMBA-Lite

Up to 72MHz>170MHz**

0.8 Dmips/MHz

n/a

4,300+ LUT3 tiles

~1,900 LUT4 cells

4Q07

<$100,000 (ARM)Free (Actel)

Table 3 – Feature comparison of the Xilinx MicroBlaze v7, MicroBlaze v6, Altera Nios II, and ARM Cortex-M1 processor cores. All are 32-bit embedded processors designed and optimized for synthesis in FPGAs. Altera’s Nios II processor is available in three basic

configurations that developers can customize. Key differences between the MicroBlaze v7 and v6 are highlighted in purple text. With its new optionalMMU/MPU, MicroBlaze v7 gains an important advantage over Nios II; the Cortex-M1 fares poorly in this comparison but is redeemed by a more

familiar CPU architecture. (FSL is the Xilinx Fast Simplex Link. Avalon-MM is Altera’s memory-mapped interface.) *Nios II/e has a six-stagepipeline, but it works like a one-stage pipe. †Maximum clock speed assumes synthesis in the fastest Xilinx Virtex-5 FPGAs. ‡Estimate for synthesis

in an Actel ProASIC3 or Fusion FPGA. **Estimate for synthesis in a Xilinx Virtex-5 FPGA. (n/a: not applicable)

PLATFORMS, PROCESSORS & TOOLS

Page 60: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

60 Xcell Journal Second Quarter 2008

departed from its long-standing practice ofnot publicly disclosing licensing fees.Although the exact price for a Cortex-M1license is still under wraps, ARM says it willcut deals for less than $100,000 – a hugeprice break. And, as described above, Actelsells preconfigured Cortex-M1 FPGAswithout requiring an ARM license at all.With Altera and Xilinx practically givingaway their FPGA-ready processors, ARMhad to modify its own licensing model tomake the Cortex-M1 competitive.

The Future of CPUs for FPGAs As FPGA prices fall, and the costs ofASICs rise, deploying an SoC in program-mable logic becomes increasingly attrac-tive. As MPR has noted in the past, theproduction volume at which an FPGAimplementation becomes more economi-cal than an ASIC implementation keepstilting in favor of FPGAs, and we perceive

nothing on the horizon that will alter thattrend. For that reason, the future ofMicroBlaze (and Nios II) looks bright.

However, one thing that could change isthe CPU architecture that developerschoose to implement in an FPGA. Fornow, MicroBlaze and Nios II are by far themost popular choices, because they are pro-moted by their respective FPGA vendorsand are practically free. ARM’s CortexM1changes the equation by introducing theoption of using the industry’s most popular32-bit embedded-processor architecture. Intime, other embedded-processor architec-tures may join the fray. At present, ARC,MIPS Technologies, and Tensilica don’tnecessarily forbid licensees from deployingdesigns in FPGAs, but they don’t encour-age it, either – and they have not optimizedtheir processors for programmable logic.

If other CPU architectures do becomeavailable for FPGAs, at affordable prices,

the time may come when the FPGA ven-dors’ own CPU architectures look lessattractive. Although Altera and Xilinx havesold many more CPU licenses than evenARM ever will, a great number of thoselicenses are purchased by students and tin-kerers who never intend to mass-producetheir designs. Companies seriously intend-ing to produce FPGA-based SoCs in vol-ume may prefer to use a more widelysupported CPU architecture. MicroBlazeand Nios II could become interesting foot-notes in the history of microprocessors.

Even if that happens, the Altera andXilinx processors will have served their pur-poses. They are selling more FPGAs, they arehelping to seed the market for FPGA-basedSoCs, and they are defining the features andoptimizations that FPGA-specific processorsshould have. Whether or not MicroBlazeand Nios II live long and prosper, they arewise investments for their vendors.

PLATFORMS, PROCESSORS & TOOLS

Page 61: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Maciej HalaszSenior Product [email protected]

Many of you have built a Linux platformon top of Xilinx® processors usingPowerPC-based Virtex™ FPGAs. Thedecision to use Linux is happening moreoften, and for a variety of reasons. Themost appealing aspect of Linux is that it isfree to end users and does not require anyroyalties or run-time license payments toan OS vendor.

There are, however, a number of chal-lenges that drive up the cost and risk ofadopting Linux. In this article, I’ll outlinethose challenges and discuss how you canleverage LinuxLink by TimeSys whendeveloping on Xilinx platforms, loweringyour cost, risk of failure, and shorteningthe project timeframe.

The ChallengeMany of you choose to build a Linux plat-form on your own, spending substantialamounts of time “hunting and gathering”the right Linux components for your proj-ect, irrespective of your experience level.

Assuming that you are successful ingathering the right components, the nextchallenge lies in the aggregation and cross-compilation of all of these pieces.Unfortunately, this aggregation processcan be complicated, particularly in anembedded environment where Linux com-ponents need to work with an embeddedprocessor. This process is unpredictable

and very error-prone, sharply increasingproject risks. All of this work to assemblea Linux platform for an application doesnot add real value to the project. Thevalue comes from applications built ontop of the Linux platform.

What is My Task?There are probably several answers to thisquestion. At a high level, most of you wantto build a Linux platform that works oncustom hardware, with a value-addedapplication on top (see Figure 1).

The hardware design is central to theproject; once such a design is in place, the

next step is to get a Linux kernel runningon it. Depending on your specific require-ments, you must change the Linux kernelto match the hardware design. Sometimesyou have to develop new device drivers.This task is supported today to somedegree by Xilinx EDK tools.

The next task is to assemble a rootfilesystem that includes the functionalityrequired by the end application running onyour custom hardware design. Assemblinga root filesystem can be a very involved andtime-consuming process that grows in timeand risk level with the number of packagesthat you need to include.

Second Quarter 2008 Xcell Journal 61

Building Linux Platforms on Xilinx Processors with LinuxLink Find out how a LinuxLink subscription accelerates your Linux development and mitigates project risks.

PLATFORMS, PROCESSORS & TOOLS

Page 62: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Once a Linux kernel and a root filesys-tem are running on the custom hardware,you can focus on developing value-addedapplications. If the application require-ments are well understood before theproject begins, this stage of the projectends here, with aggregation of all Linuxpieces into deployable images.

However, application requirementsoften change during the project cycle. Butfor Linux, this means not only applicationchanges but also root filesystem changes.Sometimes even the Linux kernel has tobe adjusted.

What Do I Need?You will need several components todevelop an embedded software platform(see Figure 2).

For any Linux-based project, Linuxcomponents are developed and/or adoptedfrom already existing sources (open-sourceprojects). Depending on the specific appli-cation and how complex the hardwaredesign is, the Linux platform developmentprocess can take anywhere from weeks tomonths, even though there are tools andinformation online that you can leveragein the process. Tools, libraries, and variousutilities are definitely needed to develop,debug, and test the Linux solutionthroughout entire project cycle.

To accelerate development processeswhile minimizing the risk of failure, youcan use highly integrated developmenttools and prebuilt Linux componentsfrom a trusted source.

Development tools that run on thehost development platform, combinedwith your experience and knowledge, willgreatly drive how much time and effortyou spend on a project.

new designs and applications. Each kernelrelease, occurring on average three to fourtimes a year, introduces new updates, fixes,and support for ever-growing sets of driv-ers. For this reason alone, the latest versionof the Linux kernel is the preferred choicefor many Linux projects.

Root Filesystem ChallengesThe number of open-source projects thatprovide filesystem functionality growsevery day, providing you with access tohundreds of Linux components that youcan adopt in your project. This is bothgood and bad for the embedded Linuxapplication. The wide selection of Linuxcomponents available from the open-source community provides platform sup-port for all kinds of applications.

On the negative side, that same breadthcauses headaches, as it is very difficult to

Linux Development ProcessThe Linux project development cycle, asshown in Figure 3, can be divided into sev-eral stages. Some of the tasks in the cycle(stages 3 and 4) are executed in parallel;others are sequential (stages 5 and 6).

The process of developing Linux-basedsystems on a Virtex platform starts at thehardware level, when you select a set ofhardware blocks (IPs) needed to supportyour application. Part of the reason whyFPGAs are so popular is the flexibility theyoffer in assembling the hardware platform.That flexibility translates directly into howLinuxLink supports you in your efforts ingetting a Linux solution in place – in atimely fashion and with minimal risk.

Linux Kernel ChoiceThe Linux kernel code is just like othersoftware: it evolves to meet the needs of

+ + = =+Xilinx FPGA

Hardware DesignValue-Added ApplicationLinux Kernel Root Filesystem Linux Platform Final Solution

Specialized Packages(Graphics, Win Mgmt, Audio, Web Server, etc.)

Basic Packages(Libraries, Utilities, etc.)

Cross-Toolchains

Command Line Dev Tools(Accelerated Development)

Profilers

Debuggers

Patching

Configuration

Tracers

Drivers

Graphical Dev Tools(Accelerated Development)

Package 1

Driver XPackage 2

Package NDriver Y

FunctionAlpha

Application

HostDevelopmentEnvironment

TargetDevelopmentComponents

Root Filesystem

Linux Kernel

FunctionBeta

Driver Z

Figure 2 – Host and target components needed in a Linux project

Figure 1 – Linux project components

62 Xcell Journal Second Quarter 2008

PLATFORMS, PROCESSORS & TOOLS

Page 63: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Second Quarter 2008 Xcell Journal 63

find the packages that work as expected orpackages that work well together.

It can take a lot of time and effort to finda good set of packages, libraries, and utilitiesthat properly support application-leveldevelopment. Certainly there are somestarting points available in the open-sourcecommunity, but they all bear the same chal-lenge – they are not easy to customize forspeed, functionality, and the footprintrequired by the final deployment image.

The process of building a Linux plat-form based on open-source informationand components is not a problem for a veryknowledgeable engineer, although it stillbears a high risk of failure.

How LinuxLink Accelerates DevelopmentLinuxLink is a subscription-based systemthat provides embedded engineers withonline access to hundreds of Linux compo-nents optimized for the Virtex platform, allof which have been tested to work witheach other.

The subscription provides you withaccess to preassembled reference distribu-tions optimized for different applicationsthat include the following:

• EDK studio project files for specifichardware configurations

• Binary Linux kernel

• Cross-toolchain with platform libraries

• Development-ready root filesystem

• Sources

Linux components available throughLinuxLink are regularly updated, providingyou with the latest code available for bothLinux kernel and platform packages.

The reference distribution is optimizedto achieve two distinctive goals:

1. Setup the platform and applicationdevelopment environment for the tar-get Virtex platform on your hostmachine

2. Get the Linux kernel up and runningin matter of minutes on a Xilinx refer-ence board (such as the ML405)

After installation of the reference distri-bution, you can develop, in parallel, bothvalue-added applications and the softwareplatform on which that application willexecute (blocks 3 and 4 in Figure 3).

You can easily assemble the Linux plat-form with pre-built, ready-to-use packageslike “alsa” sound libraries or video codecs.You can immediately download theseready-to-use packages to the host develop-ment system with the command line orEclipse-based TimeStorm tools, where theycan be aggregated into a complete andfunctional root filesystem. These filesys-tems are ready for deployment in a varietyof media, including on-board NAND flashor Compact Flash cards.

Application development is well sup-ported by the same command line andgraphical development tools used for plat-form development. The set of architec-ture-specific cross-toolchains providedwith a LinuxLink subscription come pre-

installed with the libraries frequentlyneeded for application development. Thismakes your task of building value-addingapplications seamless. Development envi-ronment setup by LinuxLink is alwayskept in sync with target binaries. Librariesthat are available on the host for applica-tion development are also available in theform of ready-to-deploy, on-target pack-ages for the platform design.

Finally, a LinuxLink subscription pro-vides access to tools that are indispensiblein the development of an embeddedLinux solution. You can download toolsfrom the LinuxLink site such as debug-gers for both system and application-leveldebugging, profilers for code optimiza-tions, and tracing tools. The subscriptioncomes with ticket-based engineering sup-port staffed with embedded engineerswith years of experience.

With a LinuxLink subscription, you gethelp throughout the entire developmentcycle (Figure 3). This means that you canget your job done much faster and in amore controlled environment.

ConclusionLinuxLink is designed to support you asyou assemble a custom Linux platform.With prebuilt, ready-to-run Linux compo-nents and support, you need not spendvaluable time building out a commoditizedopen-source platform. You can insteadfocus on the development of value-addedapplications, along with integration, test-ing, and optimization. You can then intro-duce products to the market much faster, athigher quality and lower cost.

LinuxLink supports Virtex-4 andVirtex-5 FPGAs with the latest Linux ker-nels. With a LinuxLink subscription, youcan choose from the latest continuouslyupdated Linux components for your XilinxFPGA, including the newest Linux kerneland other Linux component versions.

To find out more about how you can usethe software, tools, and support providedwith a LinuxLink subscription to accelerateyour next embedded Linux project, visitwww.timesys.com. You can browse the con-tent of the LinuxLink repository to seeeverything available to subscribers.

Linux Project Development Cycle

DevelopmentEnvironment

Setup

1

Gethering ofNeeded LinuxComponents

2

TestingDebugging

Optimization

5

IntegrationDeployment

6

LinuxPlatform

Deployment

3

ApplicationDevelopment

4

Figure 3 – Tasks in the Linux project development cycle

PLATFORMS, PROCESSORS & TOOLS

Page 64: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Xilinx Productivity Advantage.THE ADVANTAGE IS ALL YOURS!

The Xilinx Productivity Advantage (XPA) program provides all your FPGA design needs

up-front where and when you need them. This one-stop, single-vendor solution allows

you to create customized bundles of software and services to meet your specific needs.

The Xilinx Productivity Advantage program gives you the competitive advantage by:

• Delivering cost savings in reduced speed grades

• Decreasing your time to market

• Decreasing the gap between consumable and consumed technology

Design Faster and More Cost-Effectively Each XPA solution bundle includes your user licenses for ISE™, the industry’s leading

design tools and development system for FPGAs. Pre-verified and pre-optimized IP cores

and reference designs. Customized development and evaluation boards. And training

credits toward the most comprehensive portfolio of expert Education Services. The Xilinx

Productivity Advantage program: The Advantage is All Yours!

To learn more about Xilinx Services and the Xilinx Productivity Advantage,

we invite you to visit: www.xilinx.com/xpa

©2008 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

Page 65: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

MobiusTLM Simulate

HDLSimulate

HDL C

ElfBitstream

Mobius

HardwareSynthesis

Target

NetlistSimulate

SoftwareCompile

SoftwareSimulate

MobiusSoftware Generate

MobiusHardware Generate

by Per LjungPresidentCodetronix [email protected]

Many applications use a distributed archi-tecture, including multi-FPGA systemsand FPGA-accelerators for microproces-sors. Creating master-slave or peer-to-peerdistributed systems is complex and requiresa design methodology that supports effi-cient partitioning, as well as an efficientimplementation for communication andsynchronization.

In this article, I’ll present the Mobiustools and work flow for distributed systems.Mobius is a high-level language that cangenerate both efficient hardware (Verilog,VHDL) and software (C with scheduler)from a single source. Mobius uses hand-shaking for multi-threaded synchronizationand communication, resulting in a target-independent, latency-insensitive handshak-ing processes. Mobius is suitable for bothcontrol- and data-dominated applications.Figure 1 shows how Mobius is a front-endto the Xilinx® Platform Studio tool.

Second Quarter 2008 Xcell Journal 65

Quickly Build Distributed SystemsQuickly Build Distributed SystemsMobius generates both hardware and software for XPS-based projects.Mobius generates both hardware and software for XPS-based projects.

Figure 1 – Mobius is a front-end to Xilinx Platform Studio tools.

PLATFORMS, PROCESSORS & TOOLS

Page 66: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

The micro-architecture is user-specifiedusing a WYSIWYG syntax; for example,which tasks are run sequentially or in parallel.Parallel tasks synchronize and communicateusing message passing over unidirectionalpoint-to-point blocking channels. Becauseboth the reader and writer wait for the otherto be ready, the channels provide a robustbridge between hardware and software sub-systems. You can compile a Mobius sourceto hardware, software or any arbitrary com-bination, allowing for exceptional flexibilitywhen partitioning.

An ESL tool, Mobius generates softwarefor the MicroBlaze embedded processor, cir-cuits for FPGA logic, and bridges between thesoftware and hardware domains. A XilinxPlatform Studio project is automatically gen-erated with Mobius-generated software andhardware, which further increases productivi-ty. Using Mobius, you can design and imple-ment a complex co-design within an hour.

Mobius Language and ToolsCodetronix designed the Mobius language,a tiny high-level multi-threaded language,to be simple to learn and to enable efficientimplementations. A uniform programmingmodel allows for target-independent specifi-cation and subsequent compilation to tar-get-specific implementations.

The programming model uses the com-municating sequential processes (CSP)methodology with blocking message passing.The Mobius language supports sequential orparallel statements, channels, timers, events,and arbitration, in addition to parameterizedsigned/unsigned integers and fixed- andfloating-point (arithmetic, relational, tran-scendental) operators. As a result, the Mobiuslanguage offers many of the built-in capabil-ities of an RTOS.

The CSP methodology allows compile-time semantic checking of parallel con-structs (such as the illegal simultaneous readand write of a variable) for faster develop-ment. The Mobius high-speed transaction-level simulator provides quick bit-accuratesimulations, enabling functional verifica-tion with a test bench written in Mobius.The test bench is also compiled by Mobius,enabling quick verification of the generatedhardware or software.

(nodes) connected by handshaking chan-nels (arcs). Some of the handshaking prim-itives are shown in Figure 2, and can beimplemented in either hardware or soft-ware as a tiny finite state machine (FSM)reacting to the signals on any connectedhandshake channels. Because all controland dataflow is local, there are no globalFSMs or wires.

Consider the Mobius-generated hand-shaking graph for the Euclid algorithm tocompute the greatest common denomina-tor (GCD) in Figure 3. A control channelis depicted as an arc terminated in a solid

Benchmarks show that quality of resultsfor Mobius-generated software and hard-ware are excellent. For example, signedfixed-point FIR filters easily achieve 500MHz on Virtex™-4 and Virtex™-5 devicetargets, and the commstime benchmarkrunning on a MicroBlaze processor showsthat Mobius is 66 times faster than a Posixpthread implementation.

Handshaking Primitives and ChannelsThe Mobius compiler decomposes thehigh-level Mobius source into a handshak-ing graph of handshaking primitives

timer

muxN

demuxN

seqN

parN

function

procedure

cast

unary op

binary op

repeat

const

var

channel

array

transfer

if-else

do

stop

skip

13497: > 13494: !=

13478: y 13477: x

b13532: var_mux2 13522: var_mux2

13490: std_transfer

13512: std_do

13516: std_seq3

13517: std_rep

5606

gcd

5607

56105609

5608

5615

5620

5616

5622

5590

5589 5587

5588

56295634563156325630563656375635

5596 5600

5633

56245627562556285619

5617 5613 5612 5611

559413491: std_seq2

13515: std_transfer

13510: std_ifelse

13509: std_transfer

a

c

13550: var_demux4 13539: var_demux5

13503: std_transfer 13487: std_transfer

13500: -13506: -

Figure 3 – Handshake graph of GCD()

Figure 2 – Target-independent handshaking primitives

66 Xcell Journal Second Quarter 2008

PLATFORMS, PROCESSORS & TOOLS

Page 67: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Second Quarter 2008 Xcell Journal 67

dot and a hollow dot. The solid dotdenotes an active port that initiates thehandshake, and the hollow dot denotes apassive port that waits for the handshake. Apush or pull channel has arrow heads signi-fying the direction of data flow, where thearrow head may be hollow or solid, signify-ing an active or passive protocol.

Bridges between Hardware and SoftwareChannels are transport layer-independentand can use wires, SERDES, or packets asappropriate. You can insert bridges on anychannel and allow full-featured support oflegacy interconnects, including OCP-IP,Wishbone, PLB, FSL, PCI, Ethernet, orRocketIO™ transceivers. Similarly, chan-nels allow domain crossing – betweensoftware and hardware, for instance –between different clock domains orbetween different voltage domains. As aresult, implementing distributed andhardware/software systems using hand-shaking is a straightforward process.

Consider a simple example where theGCD is implemented as a hardware cir-

cuit, while the test bench runs in C on aMicroBlaze processor. The Mobiuscompiler identifies that the reader andwriter of each channel are in differentdomains and emits C for the softwaredomain, HDL for the hardwaredomain, and FSL bridges to connect thehardware and software.

The Xilinx fast simplex link (FSL) is apeer-to-peer, unidirectional, point-to-point synchronization and communica-tion protocol using queues, which mapseasily to Mobius channels. Figure 4 showshow the software writes two values to twoFSL bridges and then reads a value from athird FSL bridge that is blocking until theoutput is available. Simultaneously, thehardware reads two values from its twoinput channels and computes a result thatis written to its single output channel. Anautostart() module generates a req pulseto start the Mobius-generated hardwareon a system reset.

Automatic XPS ProjectsAssembling a complex hardware andsoftware design in Xilinx Platform Studiocan be challenging. To enable users torapidly generate correct XPS projects,Codetronix provides a utility calledxpscustom that adds Mobius-generated

software applications and hardware IP toan existing Xilinx Platform Studio project.

As shown in Figure 5, xpscustom addsMobius-generated hardware IP, bridges,and wire connections. On the software side,xpscustom automatically adds simple IPdrivers, a software-only application, and aco-design application. This results in apush-button flow where you only have tocompile the software and generate the bit-stream for a distributed implementation.

ConclusionMobius allows you to rapidly develop high-quality hardware and software solutions.You can map Mobius-generated HDL andC to arbitrary voltage, clock, physical, orco-design domains and then connect thesedomains using synchronous channels, asyn-chronous channels, or bridges to build dis-tributed embedded systems on any XilinxFPGA. Mobius also compiles your testbench to hardware and software, allowingfunctional verification in both software andhardware implementations.

Users benefit from significantly higherdesign productivity, letting them generateand explore alternative microarchitecturesto optimize their system. For more infor-mation about using Mobius in your nextdesign, visit www.codetronix.com.

Hardware

Software

autostart

gcd

bri

dg

e_fs

l_rc

han

bri

dg

e_fs

l_rc

han

bri

dg

e_fs

l_w

chan

Figure 4 – Mobius-generated hardware and software connected by bridges

Figure 5 – Hardware and software integrated into a Xilinx Platform Studio project by xpscustom

PLATFORMS, PROCESSORS & TOOLS

Page 68: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Jorge CarrilloStaff Software EngineerXilinx, [email protected]

Koji ImanishiDepartment ManagerComputex Co., [email protected]

Raj NagarajanSenior Software EngineerXilinx, [email protected]

Nobuhiro NishiguchiSuperintending EngineerComputex Co., [email protected]

Ayako SuzukiSenior EngineerComputex Co., [email protected]

More and more FPGA designs now includeembedded processors (like PowerPC andXilinx® MicroBlaze™ processors) to dealwith control tasks that are easier to describein a software language like C rather than ina hardware description language likeVHDL or Verilog.

When designing embedded systems,you probably spend the vast majority ofdesign time in the debugging phase. It isimportant to reduce the time associatedwith finding problems and solving them.When you do face design problems, it isimportant to use the right tools for the job,including a tool that allows you to identifyproblems quickly.

68 Xcell Journal Second Quarter 2008

Debugging Systems withFPGA Embedded ProcessorsDebugging Systems withFPGA Embedded ProcessorsF-Sight improves debugging efficiency through easy-to-use, state-of-the-art features.F-Sight improves debugging efficiency through easy-to-use, state-of-the-art features.

PLATFORMS, PROCESSORS & TOOLS

Page 69: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Computex F-Sight is an integrateddebugger that provides both hardware andsoftware debugging capabilities. On oneside, it allows full software debugging ofprocessors inside the FGPA. On the otherside, it allows monitoring of FPGA hard-ware signals. In this article, we will describehow F-Sight can help improve your debug-ging efficiency.

Enabling the DebuggerXilinx has enabled the features available inthe Computex debugger for use on FPGAinternal embedded processors. ForMicroBlaze processors, you can use theMicroBlaze debugging module (MDM) tocontrol and debug the processor execution.You can also use the Xilinx MicroBlaze tracecore (XMTC) to monitor the processor pro-gram execution in a non-intrusive manner.

Because of pin limitations in FGPAs, it isimportant to reduce the amount of signalsthat are brought out to pins. The XMTCprovides encoded instruction and data traceswith pin requirements only 10% of whatunencoded signals would require.

To enable the debugger to collect atrace, all you need to do is connect bothMDM and XMTC cores to theMicroBlaze processor’s debugging andtrace interfaces, respectively, and bringout the encoded trace signals to FPGApins so that F-Sight can collect the data.After FPGA implementation, connect theF-Sight debugger to your board’s Mictorconnector. If you are using a Xilinx-made

overflow. If you suspect this problem isoccurring, you can set a trigger to start orstop trace data collection. By setting thetrigger condition to allow a comparisonbetween the stack pointer and the upperstack limit, the program will break rightwhen the condition occurs. You can theneasily identify the stack overflow and thelocation where it happened.

In certain cases with real-time systems,stopping a processor for debugging maynot be an option, as stopping might changethe program behavior. Other times theproblem might occur very rarely, whichmight require you to monitor the programexecution over a long time. Using F-Sight,you can set complex trigger conditions andcollect trace data, which can be post-ana-lyzed to debug the problem.

Probing Internal SignalsWhen debugging FPGAs, it is common tobegin by simulating the design. The simu-lator is not capable of finding faults withthe specifications, although it is able tofind errors in the design. Also, it oftenhappens that designs that pass all testswhen simulated do not eventually workonce implemented on the FPGA. Whenthis happens, you are forced to debug theproblem on the actual target system byusing a logic analyzer.

The problem arises when you try tobring the signals out of the FPGA so that thelogic analyzer can monitor their waveformpatterns. As in the majority of cases whendesigning large-scale embedded systems, itwill take a considerable amount of time togo through the FPGA synthesis and imple-mentation tools even for minor changes(such as the changes needed to route signalsout to pins). Plus, you risk running intotiming problems caused by the differentplacement and routing. The actual time torun the implementation tools depends onthe scale of circuitry and the host computerperformance, but it’s possible that you willonly be able to debug a few times in a day.

Fortunately, Computex F-sight has avery useful feature that allows you to mod-ify the design to bring internal FPGA sig-nals out to pins without going through thesynthesis and implementation tools. The

ML400 series, ML500 series, orSpartan™-3E/3A/3AN FPGA board thatdoes not include a Mictor connector, youcan still employ the processor trace featurein F-Sight by using the correspondingComputex F-Sight adapter. Figure 1shows F-Sight connected to a Spartan-3board using the F-Sight adapter.

Using Processor TraceA processor trace monitors program execu-tion without stopping the processor. Thisallows you to analyze program flow overlong periods to identify problems in thecode, all without changing the processorexecution state. Computex F-Sight pro-vides processor trace capabilities that canprove very useful in different situations.

Imagine a program that is constantlygenerating exceptions. Exceptions may

happen anywhere in the program; the chal-lenge is to find out why and where theexceptions are being generated. For this,you can set a breakpoint either at the startof the exception being thrown or at theexception vector so that the program stopswhen it reaches the breakpoint. When theprogram stops, you can look at the execu-tion history collected by F-Sight. It willshow which instructions were executedbefore going into the exception handler.

Stack overflow is also a common prob-lem in embedded systems. All of a suddena program starts executing in a locationthat just does not look right. The stack isprobably getting corrupted because of the

Second Quarter 2008 Xcell Journal 69

Figure 1 – F-Sight connected to a Spartan-3 board

PLATFORMS, PROCESSORS & TOOLS

Page 70: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

feature is called “probing.” Simply selectthe internal FPGA signals in the displayshowing HDL source (Figure 2). F-Sightwill automatically take care of the rest,adding the appropriate routing to the testpins on your behalf. It does this by utiliz-ing the FPGA Editor included in XilinxISE™ software tools. With this feature,the time required for logic synthesis andplace and route – time that you could haveused for debugging – is minimized, allow-ing you to spend more time monitoringwaveform patterns.

Cooperative DebuggingWhen the system is not functioning prop-erly, the only thing you can do is investi-gate the cause of the problem based onactually occurring events. In some cases,event tracking is easier if you do it from thehardware; in other cases, it is easier if youdo it from the software. For example, in thecase of hardware, if the signal indicatingabnormality is asserted you can set the sig-nal as a trigger. In the case of software, ifthe exception handler is called out, you canset a breakpoint to the exception handlingroutine and run the user program. Theprocess of event occurrence will be cap-tured into the tracing buffer in F-Sight.

However, the issue here is that even if thehistory was captured, indentifying causeswill take time unless the correlation between

the hardware and software is known. For thisreason, Computex has implemented a coop-erative debugging feature that allows you tosynchronize the hardware (analyzer) andsoftware (trace) histories in F-Sight. Withthis feature, you can check the wave patternand program behavior at the time the eventoccurred on the same time axis. The programexecution history and the source code viewwill follow as you scroll through the wavepattern display in the analyzer window(Figure 3). The cooperative debugging fea-ture is powerful in that it quickly identifiescauses by allowing cooperative debuggingfrom both the hardware and software sides.

Debugging Flash MemoryThe FPGA’s internal memory is common-ly used to store the embedded processorprogram. However, this memory’s capacityis often not sufficient if the size of the pro-gram is too big. One alternative is to useexternal flash memory and store the userprogram in it.

Although some debuggers do not havethe capability to write to flash memory, F-Sight allows you to fully debug programsthat are located in flash memory just asyou debug programs located in internalmemory. For example, you can downloadthe user program, apply patches to a partof the memory, or even set software break-points in the flash memory.

In F-Sight, you can select from amongmore than 1,000 available types of flashmemory options. Even if the flash memo-ry you are using is not available from theoptions in the list, you can manually andeasily add the entry using the graphicaluser interface.

ConclusionDebugging systems with FPGA embed-ded processors should not be a time-consuming task. When you encounterproblems, you need to have the right toolsthat will allow you to efficiently deal withthe problem, saving you time to focus onthe development part. F-Sight certainlyproves to be a tool that will improve yourdebugging efficiency.

70 Xcell Journal Second Quarter 2008

Figure 2 – F-Sight probing

Figure 3 – F-Sight cooperative debugging

PLATFORMS, PROCESSORS & TOOLS

Page 71: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell
Page 72: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Jorge CarrilloStaff Software EngineerXilinx, [email protected]

Raj NagarajanSenior Software EngineerXilinx, [email protected]

Oliver OppitzSystem Design [email protected]

The best thing about FPGAs is their flexi-bility, which inspires designers to create amyriad of different designs. However, sup-port for debugging the design is often con-sidered last – if at all – so the debuggeroften has to adapt to the system.

The good news is that a debugger existsfrom a company that has been around fornearly 30 years in the embedded arena,gaining experience with all of the problemsyou can imagine and some that you do noteven want to hear about. In this article,we’ll show you a few examples of big andsmall features that Lauterbach’s TRACE32debugger offers – features that might saveyour day or even your project.

Flexibility in Debugging for a Flexible Platform Speaking of flexible designs, a really inter-esting customer system comes to mind. Itused two Xilinx® MicroBlaze™ processorcores with an internal block RAM memory

block on a Virtex™-5 LX50T device. Thepeculiar thing about this design was thateach MicroBlaze processor had only an I-side interface to the block RAM block forinstruction fetch. The second memory portof the block RAM was connected to a PCIExpress interface, used to remotely changethe application code and boot the processorat run time. The challenge was how todebug in a memory area that the debuggercould neither read nor write, because theMicroBlaze processor itself could not per-form load/store operations in this area.

We solved the problem usingTRACE32’s internal “virtual memory,” asimulated memory inside the debuggerthat has an indefinite address space (64bits) and for which memory is allocated ondemand. We loaded the target programinto this virtual memory and configured

the debugger internal address translationmechanism to remap access from unread-able target memory to virtual memory.This made it possible to diagnose the pro-gram even at the assembly level.

But there was another challenge: for step-ping through the program, especially overhigh-level language lines and conditionalbranches, one normally uses software break-points. With no access to the block RAMfrom the MicroBlaze processor, this was outof the question. The solution came in theform of a “map.break” command, used toforce the debugger to exclusively use hard-ware breakpoints for the given addressrange. Other flavors of the map commandallow you to specify the data width of mem-ories, change the endianness, or completelydeny the debugger from accessing an addressrange with critical peripheral registers.

72 Xcell Journal Second Quarter 2008

A Complete Debugging Solution for Xilinx Embedded ProcessorsEmploy the industry-proven TRACE32 debugger to cope with the challenges of debugging advanced embedded processor systems.

PLATFORMS, PROCESSORS & TOOLS

Page 73: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Small but Useful FeaturesLauterbach’s experience with JTAGdebuggers and emulators shows in boththe features of the TRACE32 debugger aswell as in many small and unexpecteddetails. Given that Lauterbach developsall software with its own tools, this is notsurprising. And daily problems ofteninspire the most useful features.

Did you ever want to disable theannoying timer interrupt handler duringa debugging session or invert a condition-al branch without restarting? Ever wantedto patch a loop to see if the core executescorrectly from external memory? Thebuilt-in assembler does just that.

Another typical scenario duringdebugging: How often did you just stepone line too far and had to start all over?The register-restore feature will undo thelast steps.

TRACE32 displays memory by per-manently rereading the memory 10 timesper second, even while the processor isstopped. Why does it do this? Perhaps asecond MicroBlaze processor in your sys-tem, which you kept running, is causingsome data corruption and you would liketo also monitor this. Perhaps your some-times unstable memory now causes tell-tale flickering on the screen. Or perhapsyour JTAG interface is really not so stableat 20 MHz. These are things you wouldcertainly like to know. On the otherhand, TRACE32 guarantees that it willonly access memories when required.

Having touched on the topic of periph-erals, we should also mention the periph-eral register files. These specify thelocation, width, and even bit-wise encod-ing of memory-mapped registers, groupingthem into a tree of related registers. In thisway the peripheral registers are easilyaccessible for inspection and modification:you can disable your DMA controller witha click instead of digging in the targetmanuals for the correct bit. The debuggercomes with specifications for standardperipherals, but you can modify the filesfor your purposes using a simple text edi-tor. For the Xilinx tool chain, an add-on isin the works to generate these files on thefly as part of the Xilinx build process.

and will work with any design created withthe Xilinx Embedded Development Kit(EDK). For PowerPCs, dedicated debug-ging connectors are also supported.

In the context of multi-core systems,the synchronized starting and stopping ofcores becomes an issue. For targets sup-porting this in hardware, like in multi-MicroBlaze processor configurations, thedebugger uses hardware features toachieve cycle-accurate synchronization;otherwise the synchronization is done insoftware. The integrated scripting lan-guage is multicore-aware, allowing controlof all GUIs from a master script for con-necting debuggers to their respectivecores, resetting them, and downloadingand starting application programs.

Real-Time Program Flow and Data TraceThe main feature of the real-time trace isrecording the program flow, defined aseach and every instruction the processorexecuted as well as data transactions. ForMicroBlaze processors, this is done throughthe Xilinx MicroBlaze trace core (XMTC)that comes with Xilinx Platform Studio.XMTC includes a trace encoder, whichcontains an input interface that attaches toa MicroBlaze processor trace port compris-ing approximately 200 unencoded signals.

Flexible IDETRACE32 offers a powerful graphical userinterface (GUI), but its command-line ori-gins still show in two highly useful features:the debugger command line at the bottomof the screen and the fact that practicallyevery function of the GUI is also accessiblethrough commands – and thus fromscripts. This automates all of your routinetasks, including target configuration, layingout windows, and arranging them on mul-tiple virtual screens. Best of all, the win-dows do not have a docking behavior likemany IDEs but can be freely positionedand resized, even overlapping each other.There are also couplings with various IDEs,so you can invoke TRACE32 directly fromyour Eclipse environment, for example.

Attaching to Multi-Core TargetsAnother interesting feature is Lauterbach’sintuitive approach to debugging multiplecores in the target. Just start an instance ofthe GUI for each core and have them sharea debugging cable; this also works for het-erogeneous systems with PowerPC andMicroBlaze cores or for another systemfrom the 50-plus processor architecturessupported by TRACE32 (Figure 1).

TRACE32 attaches to the same JTAGconnector used by the Xilinx platform cable

Second Quarter 2008 Xcell Journal 73

Figure 1 – Lauterbach TRACE32 debugging and trace cable connected to Xilinx ML507 board

PLATFORMS, PROCESSORS & TOOLS

Page 74: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

It also includes an output interface, whichprovides an encoded trace in 21 signals.

The trace hardware provides up to 512MB of external high-speed trace memory,which is used instead of the scarce on-chip memory resources for storing traceinformation. Trace functionality is alsosupported on PowerPC architectures.More advanced features include statisticalanalysis of function and task run times,variable access, and code coverage analysis(Figure 2, Figure 3A and 3B).

Operating System SupportOn the MicroBlaze processor, TRACE32comes with a so-called kernel awarenessmodule for µClinux and Linux. ForPowerPCs, many more operating systems aresupported, including QNX, VxWorks, andNucleus PLUS. The extensions make thedebugger aware of the kernel-related datastructures in the target. They allow debug-ging on the process level using process-spe-cific breakpoints and program control.Other features include full MMU support;real-time, non-intrusive display of Linux sys-tem resources like loaded kernel modules ormounted file systems; statistical evaluationand graphical display of task run times; andtask-related evaluation of function run times.

ConclusionThe Lauterbach TRACE32 provides a com-prehensive debugging solution for PowerPCand MicroBlaze processors on all Xilinxdevice families. Future Xilinx-related addi-tions will be to enhance the debuggingcable so that the Xilinx ChipScope™ ana-lyzer can use it for target accesses jointlywith the debugger and downloading FPGAbitstreams through the debugger itself.

For more information, please visitwww.lauterbach.com.

74 Xcell Journal Second Quarter 2008

Figure 2 – Lauterbach TRACE32 IDE showing trace, code coverage, and function call chart views

Figure 3A – Lauterbach Mictor MicroBlaze traceadapter for Xilinx Spartan-3E FPGA board

Figure 3B – Lauterbach Mictor MicroBlazetrace adapter for Xilinx MLx boards

Xilinx Global Trade Show

CalendarXilinx participates in numerous

trade shows and events through-out the year. This is a perfectopportunity to meet our silicon

and software experts, ask questions, see demonstrations

of new products, and hear other customer success stories.

For more information and themost current schedule, visit

www.xilinx.com/events/.

April 14-17, 2008NAB 2008

Las Vegas, NV

April 15-17, 2008Embedded Systems Conference

Silicon ValleySan Jose, CA

April 15-17, 2008Special Event Effects

SymposiumLong Beach, CA

April 21-22, 2008China International Medical

Electronics TechnologyShenzhen, China

May 14, 2008Linley Tech Seminar:

High-Speed InterconnectsSan Jose, CA

May 14-16, 2008Embedded Systems Expo 2008

Tokyo, Japan

Xilinx Global Trade Show

CalendarXilinx participates in numerous

trade shows and events through-out the year. This is a perfectopportunity to meet our silicon

and software experts, ask questions, see demonstrations

of new products, and hear other customer success stories.

For more information and themost current schedule, visit

www.xilinx.com/events/.

April 14-17, 2008NAB 2008

Las Vegas, NV

April 15-17, 2008Embedded Systems Conference

Silicon ValleySan Jose, CA

April 15-17, 2008Special Event Effects

SymposiumLong Beach, CA

April 21-22, 2008China International Medical

Electronics TechnologyShenzhen, China

May 14, 2008Linley Tech Seminar:

High-Speed InterconnectsSan Jose, CA

May 14-16, 2008Embedded Systems Expo 2008

Tokyo, Japan

R

PLATFORMS, PROCESSORS & TOOLS

Page 75: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Fernando Martinez Field Application [email protected]

The Xilinx® Spartan™ and Virtex™product families offer designers a widerange of choices for embedded design. Thisarray of products allows designers to createcustom accelerators for embedded applica-tions as well as entire systems on a singleFPGA. The most vexing problem facingtoday’s FPGA designers is shifting awayfrom part capacity and toward design turn-around time.

Depending on the complexity of thetarget application, going from a writtenspecification or reference C model to aworking FPGA design can take anywherefrom weeks to months. This increases thepressure on designers and makes the goal offirst to market harder to reach.

Advanced Algorithmic SynthesisThe only way to accelerate design time onan FPGA is to automate the creation of theRTL. Although several attempts have beenmade to go from C to RTL, these solutionshave fallen short of customer expectationsbecause of language transformations, appli-cation complexity, and size limitations.

These concerns, which have limited thewidespread adoption of high-level synthesisin the FPGA market, are addressed bySynfora’s PICO platform, a completedevelopment environment capable of tak-ing sequential algorithmic ANSI-C andcreating high-quality parallel hardware onpar with RTL created by hand.

Second Quarter 2008 Xcell Journal 75

Reduce FPGA Design Time with PICO Express FPGAReduce FPGA Design Time with PICO Express FPGASynfora’s PICO Express accelerates algorithm-to-design completion on an FPGA.Synfora’s PICO Express accelerates algorithm-to-design completion on an FPGA.

PLATFORMS, PROCESSORS & TOOLS

Page 76: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

PICO Express FPGA, which is a keycomponent in the PICO platform, pro-vides a complete development environ-ment capable of taking sequentialalgorithmic ANSI-C and creating high-quality parallel hardware that is on par withRTL created by hand.

The design methodology in PICOExpress FPGAs allows design teams to moveto a higher level of abstraction without sac-rificing design performance. Another aspectof the PICO design flow is the ability toretarget RTL generation across Xilinx prod-uct families in terms of part, speed grade,package, and clock frequency without hav-ing to modify the C code.

PICO is also fully integrated with FPGAback-end flows from both Xilinx andSynplicity. You can either use your own syn-thesis scripts or scripts provided by PICO togenerate an FPGA programming bitstream.

Besides integration into standard FPGAback-end flows and the ability to targetSpartan and Virtex parts, PICO gives youthe ability to carry out what-if analysis.Under design space exploration, PICOallows side-by-side comparisons of multi-ple designs targeted at different through-puts, clock frequencies, part families,package options, and speed grades.

In this article, I’ll show how PICOExpress can quickly take an ANSI-C func-tional model and create a fully functionalFPGA design. The application chosen forthis demonstration combines real-timecolor space conversion from RGB to YUVand a Sobel edge detector. Although the Cmodel for this application is part independ-ent, Synfora has implemented a workingsystem on the Spartan-3E video display kit.

Application Development on PICO ExpressApplication development on PICO Expressis divided into three stages. The first twostages are the driver and synthesizable Cmodel creation. As the example will show,

creates an on-the-fly YUV representationof this video. The conversion to YUV hasthe benefit of acting as a filtering opera-tion, which reduces noise in the originalRGB video stream.

Figure 1 shows a summarized pseudocode for the color converter; note that thetype of C code accepted by PICO Expressis no different from the C code acceptedby any ANSI-C compatible software com-piler. This allows for code developmentusing open-source tools such as gcc.

After color conversion, the new videostream is passed through a Sobel edgedetector. Sobel edge detection is a classicfilter used in image processing to reducethe information in both still images andvideo to a series of edges. The fundamen-tals of Sobel edge detection are based on:

the following convolution equation

This equation shows the use of a 3 x 3window to calculate the strength of anedge at the center of the window.Depending on the chosen threshold, pixelswill either have a value of 1 or 0 after thewindowing function. Figure 2 shows thecode used for the edge detector.

Like the code in Figure 1, the code inFigure 2 is completely compliant withANSI-C and can be compiled on any Ccompiler. Figure 2 shows the use ofarrays for storing both horizontal andvertical windows.

the synthesizable C model is completelysequential without any extra need for a userdeclaration of parallelism. The advancedcompiler built into PICO automaticallyanalyzes user applications to extract datadependencies and derive parallelism fromsequential code.

The driver code comprises a test benchused to verify the correct operation of thefunction targeted for hardware synthesis. Inthe case of this example, the driver codereads input image files, passes the data tothe hardware procedure,and gathers the results.The results of the hard-ware function are com-pared against the expectedoutput created during testvector selection.

During the severalsteps needed to go fromC to RTL, PICO simu-lates each transformationto ensure correct hardware by design. Thesimulations during hardware design arealso useful in estimating design perform-ance and resource utilization before RTLsimulation and back-end place and route.

The second stage is the creation of thesynthesizable C model. The synthesizableC model is the software procedure that willbe implemented as a hardware block. Thisprocedure code can encompass functional-ity as simple as a single accelerator block tofunctionality as complex as a completeH.264 encoder/decoder. You only need tolet PICO know the name of the top-levelprocedure in the synthesizable model.PICO Express automatically transforms allsub-procedures into hardware.

Color Conversion and Edge DetectionThe color conversion and Sobel edgedetection application is divided into twomain components. Color conversiontakes the RGB from the video source and

unsigned char rgb_to_yuv(unsigned int rgb) {... resulty = ((66* Rf)> 8) + ((129* Gf)>8) + ((25* Bf)>8) + 4096; resulty = (result + 128) > 8; if(resulty>255) resulty=255;...}

Figure 1 – RGB to YUV converter

PICO is also fully integrated with FPGA back-end flows from both Xilinxand Synplicity. You can either use your own synthesis scripts or scripts

provided by PICO to generate an FPGA programming bitstream.

76 Xcell Journal Second Quarter 2008

PLATFORMS, PROCESSORS & TOOLS

Page 77: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Second Quarter 2008 Xcell Journal 77

PICO will analyze arrays in your codeto determine the optimal mapping to min-imize resource utilization and maximizeperformance. Depending on the size of anarray and how it is used in the design appli-cation, PICO will determine the optimalstorage medium for the array as LUTs,block RAMs, or external memories. It isthe advanced memory analysis capabilitiesof PICO that allow for a simple codingmodel for memories. Arrays are a commonstorage structure with an easily understoodusage model in the C domain. Therefore,designers can specify both simple and com-plex memory architectures with ease.

Figure 3 summarizes the data and con-

trol flow in this example. For the boardimplementation, the bitstream is mappedonto a Spartan-3E device. The base C codeused to create the hardware is independentof the source video resolution. Like in mostvideo processing systems, the PICO-designed hardware will use the first imageframe to calculate the image resolutionfrom the video source.

In this case, the video is of HD qualityat 720p resolution. On the FPGA, the sys-tem runs at 133 MHz, which is sufficientto comply with the DVI standard for 720pHD video. In terms of synthesis time, thisapplication takes less than five minutes inPICO to generate RTL and less than 10minutes to generate a bitstream fromXilinx Synthesis Technology (XST).

ConclusionPICO Express for FPGA provides the nec-essary tools and methodology for you toquickly go from algorithmic code to fullyfunctional designs on an FPGA. The ANSI-C based input accepted by PICO presentsyou with a familiar coding language andhelps reduce the learning curve associatedwith high-level synthesis tools. Also, PICOprovides the capability to quickly compareprice/performance trade-offs that allow youto choose the right part without having tomodify your application code.

For additional information on howPICO Express can accelerate the develop-ment of FPGA-based embedded systems,visit www.synfora.com.

RGB to YUVFixed-Point Conversion

Original Video Frame

Sobel Edge Detector

Output Video YUV to RGBConversion

Output Video

unsigned char sobel(unsigned int window[3][3]) {... y00 = rgb_to_y(window[0][0]); y01 = rgb_to_y(window[0][1]); y02 = rgb_to_y(window[0][2]); y10 = rgb_to_y(window[1][0]); y12 = rgb_to_y(window[1][2]); y20 = rgb_to_y(window[2][0]); y21 = rgb_to_y(window[2][1]); y22 = rgb_to_y(window[2][2]);

weight_horizontal = weight_calc(y00, y01, y02, y20, y21, y22);

weight_vertical = weight_calc(y00, y10, y20, y02, y12, y22);

total_weight = weight_horizontal + weight_vertical; return total_weight >= THRESHOLD;}

Figure 2 – Sobel edge detector

Figure 3 – Color conversion and Sobel edge detector

Topics Relevant to You Delivered in One Monthly

Newsletter!

Thanks for your feedback! We have now created a new

Xilinx newsletter that will provide you all of the information

you need in one location. This newsletter will replace themultiple topic newsletters youhave currently been receiving.

It's simple! Choose only the topics you want to receive

and your customized newsletterwill be delivered to your

inbox every month.

http://newsletter.xilinx.com

Topics Relevant to You Delivered in One Monthly

Newsletter!

Thanks for your feedback! We have now created a new

Xilinx newsletter that will provide you all of the information

you need in one location. This newsletter will replace themultiple topic newsletters youhave currently been receiving.

It's simple! Choose only the topics you want to receive

and your customized newsletterwill be delivered to your

inbox every month.

http://newsletter.xilinx.com

R

PLATFORMS, PROCESSORS & TOOLS

Page 78: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Angela Sutton Sr. Product Marketing Manager Synplicity, Inc. [email protected]

The Xilinx® Embedded Development Kit(EDK) is available to designers wishing toimplement embedded designs using IBMPowerPC hard processor cores, XilinxMicroBlaze™ embedded processor cores,and Xilinx-supplied implementations ofbuses, block RAMs, and peripherals such asUSB and PCIe.

If you generated an embedded proces-sor subsystem with EDK and wish to syn-thesize and debug your complete systemfor implementing on a Xilinx FPGA, youcan do so using the Synplify Pro and/orSynplify Premier (referred to in this arti-cle as Synplify Pro/Premier) integratedembedded design flows from Synplicity,Inc. These Synplify Pro/Premier/EDKflows allow you to perform optimizationand debug operations on an entireembedded FPGA system, includingXilinx embedded cores.

An integrated Synplify Pro/Premier/EDKflow allows designers to synthesize anentire design top-down as one designentity, handled as a single SynplifyPro/Premier design project.

Design Embedded Systems with Xilinx and SynplicityDesign Embedded Systems with Xilinx and SynplicityYou can use the integrated Xilinx EDK and Synplicity’s Synplify Pro orSynplify Premier software development flow to synthesize and debug.You can use the integrated Xilinx EDK and Synplicity’s Synplify Pro orSynplify Premier software development flow to synthesize and debug.

Flow

ISE software/EDK hard-ware development flow

(.ise project file)

Standalone EDK hard-ware development flow(.xmp project file)

When to Use

Most processor-basedembedded subsystemsand designs with non-peripheral custom logic

PowerPC or MicroBlazeprocessor-basedembedded system orsub-modules

Details

•Allows the addition of non-peripheral custom logic to the system

•Allows the addition of HDL files that contain non-peripheral custom logic directly to the project

•You can integrate processor subsystems aseither top-level modules or sub-modules

•The flow includes system top-level hardwaresynthesis, mapping, and place and routeprocesses in the background so that it is notnecessary to work with Xilinx ISE softwareto generate the configuration bitstream file.

•Compiles and executes multiple softwareapplications, generates libraries for thecode, and merges software with hardwarefiles for downloading to a target board.

•Each hardware component is treated as anindependent core (a separate XST projectfile is generated for each core).

•During synthesis, XST automatically runs oneach core separately to generate netlists(NGC files) through the Xilinx Xflow.

•If the core is encrypted (MicroBlaze proces-sor cores, for example), XST is alwaysinvoked to provide an NGC netlist.

78 Xcell Journal Second Quarter 2008

Table 1 – Choosing which Synplify Pro/Premier/EDK flow to use

PLATFORMS, PROCESSORS & TOOLS

Page 79: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

In addition to automating the flow, thisintegrated Synplify Pro/Premier/EDK flowoffers the following advantages:

• Better quality of results (QoR): SynplifyPro/Premier considers the design in itsentirety when performing timing-drivenand area optimizations. Moreover, thetools can time through the core and maypotentially optimize across EDK coreboundaries, which improves design per-formance QoR.

• Visibility and debugging capabilities dur-ing implementation. You only need to runsynthesis once to generate the final designnetlist. Debugging and optional floorplan-ning can be performed on the entiredesign. You can also use SynplifyPro/Premier’s HDL analyst, timing ana-lyst, physical analyst, island timing analyst,and add-on Identify RTL debugging toolto debug and analyze the entire design.

Let’s describe this flow in more detail.

Overview of Xilinx EDK FlowThe Synplify Pro or Synplify Premier toolfrom Synplicity is required to perform syn-thesis, as a part of the overall embeddedFPGA hardware creation process, after theEDK cores have been included in the design.You can use the two available Xilinx EDKembedded hardware development flows thatexist today to create your embedded designand to generate the embedded hardware inHDL format.

The two flows are:

• Xilinx stand-alone EDK hardware devel-opment flow for those using XilinxPlatform Studio (XPS)

• Xilinx ISE™ software/EDK hardwaredevelopment flow for those using XilinxISE software

Table 1 includes additional details aboutwhich of these two flows to use. Figure 1illustrates the overall design flow steps forboth flows.

Integrated Synplify Pro/Premier/EDK FlowThe Synplify Pro/Premier/EDK flow com-prises four steps, as shown in Figure 2.

Here are some details about each step.

The .ise or .xmp file generated will referto a variety of design files that represent thecore. These files are shown in Table 2.

2. Read the generated files into SynplifyPro/Premier design project. The filesthat Synplify Pro and Synplify Premierautomatically read are listed in Table 3.The Synplify Pro/Premier tools read theNGC netlist and EDK-generated plat-form (as specified by the .ise/.xmp proj-ect files); they are automaticallyincluded in the design project. TheSynplify Pro/Premier software readsmany different files that are part of theXilinx project file, including RTL,netlists, and constraints (V/VHD,EDN, NGC, NGO, SDC, UCF, OPT).It adds these definition and constraintsfiles to the synthesis project it createsbefore running synthesis.

1. Generate embedded cores using XPS orISE software. Create an EDK projectthat Synplify Pro/Premier can read togenerate a core netlist using one of thetwo flows for XPS users:

• The Xilinx stand-alone EDK hard-ware (generates an .xmp file)

• Xilinx ISE software/EDK hardwaredevelopment flow (generates an .isefile) for ISE software users

In the XPS GUI:

• Select Project > Project Options >Hierarchy and Flow

• Select Implement Design in ISEsoftware (Export to ProjectNavigator Flow: Depreciated)

• Select Hardware > Generate Netlistto generate the netlist

Hardware Development FlowHardware Compilation

Hardware Compilation

Automatic Hardware

Platform Generation

Processor, Peripherals,

and Bus Specification

Import Bitmap Back to XPS

Run Place and Route

Run Synthesis

Add / Edit HDL Code

Import Project into Synplicity

Bitstream

Hardware Configuration

Xilinx Platform Studio

.ise or .xmp

Xilinx project

<system>.bit

<system>_bd.bmm

Bit file

4. Read bitmap file into Xilinx

Platform Studio

1. Generate embedded cores

Synplify Pro, Synplify Premier,

ISE Software

2. Read into

Synplify design project

.prj

Synplify project

3. Synthesize, place and route

Second Quarter 2008 Xcell Journal 79

Figure 1 – Synplify Pro/Premier/EDK flow. Embedded cores are read into the Synplify Pro/Premier tool, where the design is fully synthesized. The post-place and

route-generated bitmap is returned to Xilinx Platform Studio.

Figure 2 – The four steps involved in the EDK Synplify Pro/Premier flow.

PLATFORMS, PROCESSORS & TOOLS

Page 80: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Note that any custom HDL logic willbe included in the .ise project file andread automatically by the SynplifyPro/Premier tool if you used theXilinx EDK/ISE hardware develop-ment flow. However, you must addthese files manually to the SynplifyPro/Premier project if you are usingthe Xilinx stand-alone EDK hardwaredevelopment flow.

3. Invoke and run Synplify Pro/Premier,then place and route.

• In the Synplify Pro/Premier tool,open the Synplify Pro/Premierdesign (.prj) project, which nowincludes the .ise/.xmp embeddedcores. The cores will now be partof the Synplify Pro/Premierdesign.

• If the EDK cores are a subsystemin a larger design, simply instanti-ate the subsystem into the top-level HDL design.

• Run synthesis and then completeplace and route using the normalSynplify Pro/Premier ISE softwareflow. You can run Xilinx place androute from the SynplifyPro/Premier interface to generate atop-level hardware bitmap file.

Table 4 shows the output files createdafter synthesis, including a bitmap file.

4. Read bitmap back into XPS.

• Import the bitmap file back into theXPS environment. The <system>.bitand <system>_bd.bmm files original-ly found in the synplify/rev_1/par_1directory should be placed in the“implementation” directory withinthe EDK project directory. In theXPS GUI, select Project > Importfrom ProjNav.

• Select the BIT and BMM filesfrom the PAR directory

• BIT file synplify/rev_1/par_1/<system>.bit

• BMM file synplify/rev_1/par_1/edkBmmFile_bd.bmm

• Click OK

• Run Device Configuration >Update Bitstream. You can nowdownload the bitstream into theFPGA.

Encrypted CoresIn some cases, the cores are encrypted byXilinx (for example, Xilinx MicroBlazeprocessors). You can merge these coresinto your design with SynplifyPro/Premier. The core may appear as ablack box in the tool.

The PAO file generated by XPS willpoint to an encrypted IP core definitionfile. Synplify Pro/Premier software allowsyou to specify the appearance of the coreas a black box (in which case the corewrapper file will be present and copiedinto the Synplify Pro/Premier folder). Ifyou specify that the core should not betreated as a black box – in other words, apredefined netlist exists for the core – thetool will seek to locate a definition filenamed implementation/<core>.ngc. If itfinds the file, it adds it to the SynplifyProject .prj file. If it does not find theimplementation/<core>.ngc file, it issuesan error message.

In Synplify Pro/Premier software ver-sion 9.0.2 and above, a new secure .ngcflow exists that allows you to easily includeencrypted cores in the synthesis flow. Italso allows the Synplify Pro/Premier toolto additionally optimize inside cores,resulting in better overall QoR.

ConclusionAn increasing percentage of FPGA designsinclude cores. For designers generatingmicroprocessor and peripheral design coresusing the Xilinx Embedded DevelopmentKit (EDK), the Synplify Pro and SynplifyPremier software tools offer a fully inte-grated flow that allows you to synthesizeand implement an FPGA system.

These tools offer better QoR and amore convenient flow, as well as fulldesign debugging and analysis during thesynthesis and placement phases. For moreinformation, contact Synplicity customersupport at [email protected].

80 Xcell Journal Second Quarter 2008

Input Files Read by Synplify Pro/Premier

.ise Project file in ISE softwareproject directory, if created

.xpm and .mhs Project file in EDK project directory, if created

Design specification files in EDK project directory, if available, or in EDK hardware library in a location specified through the EDK installation path

Output Files Created by Synplify Pro/Premier

.prj Project file

.sdc Constraint file

.ucf Xilinx user constraint file

OPT file and core Xilinx Xflow OPT file and wrapper file core wrapper file generated,

if the core is a black box to the Synplify directory (for example, if the core was encrypted)

.bmm Bitmap file – in Synplify Pro/Premier place and route directory

.mpd, .pao,HDL (.v, .vhd),.edn, .ngo, .ngc

Files and Directories Created by EDK.xmp Xilinx microprocessor

project file

.mhs Microprocessor hardware specification file

.mss Microprocessor software specification file

pcores/ User IP cores

hdl HDL wrappers (encrypted cores)

etc/ UCF Xilinx constraint files

data/ Place and route option files

synthesis/ Synthesis files

implementation/ Top-level place and route results, block .ngc netlist files

Table 2 – The embedded core files generated by EDK software

Table 3 – These EDK files are read by Synplify Pro and Synplify Premier

Table 4 – Output files generated by Synplify Proand Synplify Premier, ready for Xilinx

Platform Studio to read

PLATFORMS, PROCESSORS & TOOLS

Page 81: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

Xilinx maintains and updates a large data-base of application notes on numerous top-ics as well as reference systems to assist youin your next design.

PowerPC 440 System Simulationhttp://www.xilinx.com/support/documentation/application_notes/xapp1003.pdf

This reference system demonstrates thefunctionality of the PowerPC 440 proces-sor block on the Virtex™-5 FXT FPGA.This system includes common peripheralslike Ethernet, DDR2, DMA, interruptcontroller, and RS232.

Several stand-alone software applicationscan verify functionality of the peripheralsand the PowerPC 440 processor block.

You can learn how to set up the simula-tion environment for the system, executethe simulation, and add PLB v4.6 periph-erals. The Xilinx ML507 Rev A board ver-ifies the system in hardware on the Virtex-5FXT FX70TFF1136-1FPGA.

PLBv46 PCI Using the ML555 EmbeddedDevelopment Platform / ML410 EmbeddedDevelopment Platform / Avnet Spartan-3FPGA Evaluation Boardhttp://www.xilinx.com/support/documentation/application_notes/xapp999.pdf

http://www.xilinx.com/support/documentation/application_notes/xapp1001.pdf

http://www.xilinx.com/support/documentation/application_notes/xapp1038.pdf

These application notes describe how tobuild reference systems for the processorlocal bus peripheral component intercon-nect (PLBv46 PCI) core using either aPowerPC 405 or MicroBlaze™ processor-based embedded system.

A set of files containing XilinxMicroprocessor Debugger (XMD) com-mands are provided for writing to the con-figuration space headers and to verify that

the PLBv46 PCI core is operating correctly.Several software projects illustrate how toconfigure the PLBv46 PCI core(s), set upinterrupts, scan configuration registers, andset up and use DMA operations.

PCI Bus Performance Measurements Using the Vmetro Bus Analyzerhttp://www.xilinx.com/support/documentation/application_notes/xapp998.pdf

This application note illustrates how to meas-ure performance using the Vmetro VanguardPCI bus analyzer. These measurements use asystem in which the ML410 evaluation plat-form, the ML555 evaluation platform, andVmetro Vanguard-PCI (VG-PCI) boardscommunicate over the PCI bus. The testmethod is provided so that similar tests can bedone using different PCI bus transactions,different boards, or other Xilinx PCI cores.

Accessing Spartan-3AN In-System Flash Using XPS SPIhttp://www.xilinx.com/support/documentation/application_notes/xapp1034.pdf

The application note demonstrates how toaccess the in-system flash in a Spartan™-3AN FPGA after the FPGA is configured.The software applications use the serialperipheral interface (XPS SPI) core in aMicroBlaze processor-based reference system.

Introduction to Software Debugging on Xilinx PowerPC 405 Embedded Platforms /MicroBlaze Processor Embedded Platformshttp://www.xilinx.com/support/documentation/application_notes/xapp1036.pdf

http://www.xilinx.com/support/documentation/application_notes/xapp1037.pdf

These application notes offer an introduc-tion to software debugging of XilinxPowerPC 405 or MicroBlaze embeddedprocessor platforms by discussing the use ofthe Xilinx Microprocessor Debugger

(XMD) and the GNU software debugger(GDB) to debug software defects.

Ethernet PHY Register Access with GPIOhttp://www.xilinx.com/support/documentation/application_notes/xapp1042.pdf

The XPS Ethernet MAC lite peripheral does notprovide any mechanism to access Ethernet PHYregisters. These registers are used to configureauto negotiation parameters and to obtain PHYstatus. This application note provides a PowerPC405 reference system and a MicroBlaze processorsystem where the PHY serial management inter-face signals (MDC, MDIO) are connected to anXPS GPIO peripheral.

Setup of a MicroBlaze Processor Design for Off-Chip Tracehttp://www.xilinx.com/support/documentation/application_notes/xapp1029.pdf

This application note describes how to modifyan existing MicroBlaze processor design to sup-port the trace features in MicroBlaze processorversion 7 and above. It is a detailed tutorial onhow to add and configure the trace core in aMicroBlaze processor design. The applicationnote does not target any specific developmentboard, but I/O constraints for some boards areprovided to work with Lauterbach andComputex trace tools.

VxWorks 6.x on the ML403 EmbeddedDevelopment Platformhttp://www.xilinx.com/support/documentation/application_notes/xapp947.pdf

This guide shows the steps required to build andconfigure a ML403 embedded developmentplatform to boot and run the VxWorks RTOS. AVxWorks bootloader is created, programmedinto flash, and used to boot the design. The con-cepts presented here can be scaled to anyPowerPC-enabled development platform. Thispopular application note is now available forEDK / ISE™ software 10.1.

Second Quarter 2008 Xcell Journal 81

Embedded Processing ApplicationNotes and Reference SystemsPopular application notes and reference systems available from Xilinx.

T E C H N I C A L L I T E R A T U R E

Page 82: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

by Jannis McReynoldsSr. Manager, Services MarketingXilinx, [email protected]

More than ever, designers are being asked toreduce system costs, reduce power con-sumption, and meet aggressive designschedules. With QuickStart!, Xilinx helpsyou meet these challenges with an unprece-dented level of on-site support and training.

QuickStart! is available for a wide rangeof technologies, including embedded sys-tems, PlanAhead™ software floorplan-ning, timing optimization, multi-gigabittransceivers, and PCIe integration.

QuickStart! provides you with the accel-erated ramp-up time you need to go tomarket faster. QuickStart! ensures that youwill have a more knowledgeable team – onethat is capable of executing better designs,lowering overall costs, and enhancing yourcompetitive edge.

What is QuickStart!?QuickStart! supplies you with an engineerat your site for one week. The Xilinx expertwill train and empower your team to com-plete your project on time and on budget,while ensuring you make the best use ofyour Xilinx device.

QuickStart deliverables include:

• A customized, two-day, on-site, technology-specific course

• Three days of consulting from a seniorapplications engineer

• Expert advice on design architectureand implementation optimization

Two-Day Customized Course: Embedded Systems DevelopmentCustomers engaged in an embedded designwill receive the Xilinx Embedded SystemsDevelopment course.

This course gives you a better under-standing of how to develop a PowerPC andMicroBlaze™ processor embedded system

by using the Embedded Development Kit(EDK). This course provides hands-on labsregarding the development, debugging, andsimulation of the embedded system. Labsprovide you with a choice of targeting eitherPowerPC or MicroBlaze processor systems.

Note that QuickStart! is available forprojects beyond embedded processing.Please see Table 1 for details about our addi-tional offerings.

Three Days of Consulting: Deliverables

• Configuration of the Xilinx design environment

• Set up and customization of EDK software

• Design architecture and implementationconsultation and guidance

• System partitioning consultation

• Advice on using of advanced featuresand capabilities of the Xilinx MicroBlazeprocessor and PowerPC processor

How You Benefit• Shortened ramp-up times by learning

essential design and debugging techniques

• Empowered hardware teams through coach-ing on software-based optimized systems

• Maximum performance gained fromunderstanding advanced and proven FPGAdesign techniques

• Minimized design risks acquired by usingthe finest expert advice

Titanium Dedicated EngineeringTitanium Dedicated Engineering provides asenior engineer on-site similar to QuickStart!However, Titanium allows for longer engage-ments (more than one week) and does notinclude a training class. For more information,visit www.xilinx.com/titanium.

Take the Next StepGet your next Xilinx design off to the rightstart with QuickStart! To learn more, visitwww.xilinx.com/quickstart.

82 Xcell Journal Second Quarter 2008

Embedded Processing QuickStart! Get your design off to the right start.

G L O B A L S E R V I C E S

Type Course Specific Features

Embedded Embedded Systems Development Design environment configurationEDK software customizationDesign architecture consultationSystem partitioning guidanceMicroBlaze / Power PC processor optimization

PlanAhead Software Designing with PlanAhead Software PlanAhead software design environment configuration Floorplanning implementation consultationSynthesis and project tipsDesign and performance-improving analysis

Timing Optimization Designing for Performance Proper constraining techniquesBest-known HDL coding methods Timing optimization reviewsDebugging techniques demonstrations

MGT Designing with MGTs MGT ports and attributes educationAdvanced features utilizationConfiguration methodsSimulation and implementation flows

PCIe Integration Designing with MGTs Migrating Virtex-II/Virtex-4 software designsAdvanced timing closureEvaluating design requirements

Table 1 – QuickStart engagement types

Page 83: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell
Page 84: EMBEDDED PROCESSING SOLUTIONS EDITION Xcell

www.xilinx.com/spartan

©2008 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

GET UP TO 50% LOWER COSTXilinx Spartan™-3 Generation devices are the only low-cost FPGA family to cut

your total cost by half… and we can prove it! We offer the industry’s widest

range of devices, most comprehensive IP library (8x the nearest competitor!),

a huge selection of development boards and we’ve minimized the need for

external components.

Visit us at www.xilinx.com/spartan to download our free ISE™ WebPACK™ design tools and

start your Spartan-3 design today. You too can experience the lowest total cost. Period.

The World’s Most Widely Adopted Low-Cost FPGAsPN 2074