Click here to load reader

Xcell Journal issue 75

  • View

  • Download

Embed Size (px)


Xcell Journal is the premier magazine for those interested in Field Programmable Gata Arrays (FPGAs) and FPGA-based innovation in electronics.

Text of Xcell Journal issue 75


    S O L U T I O N S F O R A P R O G R A M M A B L E W O R L D

    Xilinx FPGAs Beam Up Advanced Radio Astronomy in Australia

    Xcell journalXcell journalI S SUE 75 , S ECOND QUAR TER 2011

    High-Performance FPGAs Fly in U.K. Microsatellites

    Excerpt: FPGA-Based PrototypingMethodology Manual

    ISE 13.1 Rolls, with PlanAheadEnhancements

    High-Performance FPGAs Fly in U.K. Microsatellites

    Excerpt: FPGA-Based PrototypingMethodology Manual

    ISE 13.1 Rolls, with PlanAheadEnhancements


    Xilinx Pioneers New Class of Device in Zynq-7000 EPP Family

    Xilinx Pioneers New Class of Device in Zynq-7000 EPP Family

  • Avnet, Inc. 2011. All rights reserved. AVNET is a registered trademark of Avnet, Inc.

    Compact, Easy-to-Use KitDemonstrates the Versatilityof Spartan-6 FPGAsThe low-cost Spartan-6 FPGA LX9 MicroBoard

    is the perfect solution for designers interested

    in exploring the MicroBlaze soft processor or

    Spartan-6 FPGAs in general. The kit comes with

    several pre-built MicroBlaze systems allowing

    users to start software development just like

    any standard off-the-shelf microprocessor. The

    included Software Development Kit (SDK) provides

    a familiar Eclipse-based environment for writing and

    debugging code. Experienced FPGA users will find

    the MicroBoard a valuable tool for general-purpose

    prototyping and testing. The included peripherals

    and expansion interfaces make the kit ideal for a

    wide variety of applications.

    Xilinx Spartan-6 FPGALX9 MicroBoard Features

    tAvnet Spartan-6 FPGA LX9 MicroBoard

    tISE WebPACK software with device locked SDK and ChipScope licenses

    tMicro-USB and USB extension cables

    tPrice: $89

    To purchase this kit, visit call 800.332.8638.

  • Time to market, engineering expense, complex fabrication, and board trouble-shooting all point to a buy instead of build decision. Proven FPGA boards from theDini Group will provide you with a hardware solution that works on time, andunder budget. For eight generations of FPGAs we have created the biggest, fastest,and most versatile prototyping boards. You can see the benefits of this experience inour latest Vir tex-6 Prototyping Platform.

    We started with seven of the newest, most powerful FPGAs for 37 Million ASICGates on a single board. We hooked them up with FPGA to FPGA busses that runat 650 MHz (1.3 Gb/s in DDR mode) and made sure that 100% of the boardresources are dedicated to your application. A Marvell MV78200 with Dual ARMCPUs provides any high speed interface you might want, and after FPGA configu-ration, these 1 GHz floating point processors are available for your use.

    Stuffing options for this board are extensive. Useful configurations start below$25,000. You can spend six months plus building a board to your exact specifications,or start now with the board you need to get your design running at speed. Best ofall, you can troubleshoot your design, not the board. Buy your prototyping hardware,and we will save you time and money. 7469 Draper Avenue La Jolla, CA 92037 (858) 454-3419 e-mail: [email protected]

    DN2076K10 ASIC Prototyping Platform

    Why build your own ASIC prototyping hardware?





    12V Power

    JTAG Mictor SATA


    Seven50A Supplies


    SMAs forHigh Speed


    SODIMM Sockets

    PCIe 2.0

    PCIe to MarvellCPU

    PCIe 2.0

    USB 2.0

    USB 3.0


    10 GbEthernet


    3rd PartyDeBug


    12V Power

    All the gates and features you need are off the shelf.


  • L E T T E R F R O M T H E P U B L I S H E R

    Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX:

    2011 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. All other trade-marks are the property of their respective owners.

    The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby.

    PUBLISHER Mike [email protected]

    EDITOR Jacqueline Damian

    ART DIRECTOR Scott Blair

    DESIGN/PRODUCTION Teie, Gelwicks & Associates1-800-493-5551

    ADVERTISING SALES Dan [email protected]

    INTERNATIONAL Melissa Zhang, Asia [email protected]

    Christelle Moraga, Europe/Middle East/[email protected]

    Miyuki Takegoshi, [email protected]

    REPRINT ORDERS 1-800-493-5551

    Xcell journal

    Its SpringLet the Innovation Begin!

    Its springtime! The flowers are blooming, the bees are buzzing and the IC design com-munity is readying its latest wares for the beginning of its trade show season. TheEmbedded Systems Conference and Design Automation Conference are my twofavorites, as they display the latest embedded software and hardware design tools fromvendors across the world.

    At this years ESC in San Jose (May 2-5), Xilinx will showcase its Zynq-7000 ExtensibleProcessing Platform (EPP) in Booth #1332. I fully expect this exciting new technologythesubject of this issues cover story (see page 8)to be a game-changer for the electronicsindustry. From silicon decisions to the design software infrastructure, this extremely well-thought-out architecture will immediately snatch up sockets now held by two-chipsolutions (pairing an ARM SoC with an FPGA) as well as standalone ASICs, as process tech-nology complexity, cost and thus risk simply become more prohibitive for a growingnumber of applications. With the Zynq-7000 EPP, design teams get a high-end ARM MPUtightly coupled with programmable logic on the same ICa powerful one-two punch thatsimply isnt available with other SoCs.

    In fact, I predict that with an architecture that boots the processor first, the Zynq-7000EPP will appeal to a much broader base of designers, opening up new design possibilitiesfor folks with software backgrounds, tinkerers as well as those traditionally designing andprogramming ICs. With a starting price at less than $15 (in volume) for the smallest of theserelatively immense devices, the Zynq-7000 EPP is simply too cool not to at least try.

    In addition to the excitement of the Embedded Systems Conference, I always look for-ward to the Design Automation Conference (San Diego, June 5-10), at which the small butmighty EDA industry shows its latest wares. Over the last two years, Ive served as a mem-ber of DACs Pavilion Panel organizing committee. This year Im putting together threeintriguing panels (see

    On Monday, Herb Reiter, GSA consultant for 3-D IC programs, will moderate a panel called3-D IC: Myth or Miracle? Panelist Ivo Bolsens, Xilinxs own CTO, will be discussing Xilinxsnew stacked silicon interconnect technology (see cover story, Xcell Journal Issue 74, Ivo and Herb will be Riko Radojcic from Qualcomm and TSMCs Suk Lee.

    On Tuesday, my buddy over at EE Journal, Kevin Morris, is moderating a panel calledC-to-FPGA Tools: Ready for the Mass Market? This event has a stellar panel, including JeffBier of BDTI, Mentor Graphics Simon Bloch and Synopsys Johannes Stahl. As the title sug-gests, theyll be discussing the viability and market opportunities for C-to-FPGA tool flows.

    Last but not least, venture capitalist Lucio Lanza will command a panel on Wednesdaycalled What EDA Isnt Doing Right. This will be a no-holds-barred discussion betweenLucio and a group of power EDA usersBehrooz Abdi of NetLogic Microsystems, CharlesMatar of Qualcomm and PMC Sierras Alan Nakamotoon the real pain points of designtoday and what EDA needs to do to address them and, in turn, grow the EDA market further.

    To learn more, visit and Theres a lot to look forward to as the weather warms and the trade showseason begins.

    Mike SantariniPublisher

  • 3030

    C O N T E N T S


    Letter From the PublisherIts SpringLet the

    Innovation Begin!4Xcellence in AerospaceHigh-Performance FPGAs

    Take Flight in Microsatellites14

    Xcellence in AutomotiveFast Startup for Xilinx FPGAs18

    Xcellence in High-Performance ComputingReconfigurable System Uses Large

    FPGA Computation Fields24

    Xcellence in Scientific AppsXilinx FPGAs Beam Up Next-Gen

    Radio Astronomy30

    Cover StoryZynq-7000 ExtensibleProcessing Platform SetsStage for New Era of Design



  • S E C O N D Q U A R T E R 2 0 1 1 , I S S U E 7 5

    FPGA101Area-Efficient Design of

    the SAD Function on FPGAs38

    FPGA101Demystifying FPGAs

    for Software Engineers44

    Xperts CornerExcerpt: What Can FPGA-Based

    Prototyping Do for You?50



    Xtra Xtra The latest Xilinx tool updates and patches, as of March 201160

    Xamples A mix of new and popular application notes64

    Xclamations! Share your wit and wisdom by supplying a caption for

    our techy cartoon, and you may win

    a Spartan-6 development kit66




  • 8 Xcell Journal Second Quarter 2011


    Zynq-7000 EPP Sets Stage for New Era of Innovations

    by Mike SantariniPublisher, Xcell JournalXilinx, [email protected]

  • Second Quarter 2011 Xcell Journal 9

    X ilinx has just unveiled the first devicesin a new family built around itsExtensible Processing Platform (EPP),a revolutionary architecture that mates a dualARM Cortex-A9 MPCore processor with low-power programmable logic and hardenedperipheral IP all on the same device (see coverstory, spring 2010 issue of Xcell Journal,ht tp : / /www.xi l inx .com/pub l i ca t ions /archives/xcell/Xcell71.pdf). In March of thisyear, Xilinx officially announced the first fourdevices of what it has now dubbed the Zynq-7000 EPP family.

    Implemented in 28-nanometer process tech-nology, each Zynq-7000 device is built with anARM dual-core Cortex-A9 MPCore processingsystem equipped with a NEON media engineand a double-precision floating-point unit, aswell as Level 1 and Level 2 caches, a multimem-ory controller and a slew of commonly usedperipherals (Figure 1). While FPGA vendorshave previously fielded devices with both hard-wired and soft onboard processors, the Zynq-7000 EPP is unique in that the ARM processorsystem, rather than the programmable logic,runs the show. That is, Xilinx designed the pro-cessing system to boot at power-up (before theFPGA logic) and to run a variety of operatingsystems independent of the programmable logicfabric. Designers then program the processingsystem to configure the programmable logic onan as-needed basis.

    With this approach, the software program-ming model is exactly the same as in standard,fully featured ARM processor-based systems-on-chip (SoCs). Previous implementations requireddesigners to program the FPGA logic to get theonboard processor to work. That meant you hadto be an FPGA designer to use the devices. Thisis not the case with the Zynq-7000 EPP.

    Xilinxs Zynq-7000 Extensible Processing Platform family mates a dual ARM Cortex-A9 MPCore processor-based system on the same device with programmable logic and hardened IP peripherals, offering the ultimate mix of flexibility, configurability and performance.

    C O V E R S T O R Y

  • The new product family eliminatesthe delay and risk of designing a chipfrom scratch, meaning system designteams can quickly create innovativeSoCs leveraging advanced hardwareand software programming versatilitysimply not achievable in any othersemiconductor device. As such, theZynq-7000 EPP stands poised to allowa broader number of innovatorswhether they are professional hard-ware, software or systems designers,or simply makersto explore thepossibilities of combining processingplus programmable logic to createapplications no one has yet imagined.

    At its most basic level, Zynq-7000EPP is an entirely new class of semi-conductor product, said LarryGetman, vice president of processingplatforms at Xilinx. It is not just aprocessor and it is not just an FPGA.We are combining the best of boththose worlds, and because of that wetake away many of the limitations youhave with existing solutions, especial-ly two-chip solutions and ASICs.

    Getman notes that many electronicsystems today pair an FPGA and eithera standalone processor or an ASIC withan onboard processor on the samePCB. Xilinxs new offering will allow

    companies using these types of two-chip solutions to build next-generationsystems with just one Zynq-7000 chip,saving bill-of-material costs and PCBspace, and reducing overall powerbudgets. And because the processorand FPGA are on the same fabric, theperformance increase is immense.

    Zynq-7000 EPP will also speed upthe natural market migration fromASICs to FPGAs, Getman said.Implementing ASICs in the latestprocess technologies is too expen-sive and too risky for a growing num-ber of applications. As a result, moreand more companies are embracing

    10 Xcell Journal Second Quarter 2011

    C O V E R S T O R Y



    Static Memory ControllerQuad-SPI, NAND, NOR

    ARM CoreSight Multi-core and Trace Debug

    NEON/FPU Engine

    Cortex-A9 MPCore32/32/KB I/O Caches

    512 KB L2 Cache Snoop Control Unit (SCU)

    Timer Counters 256 KB On-Chip Memory

    General Interrupt Controller DMA Configuration

    AMBA Switches


    AMBA SwitchesAMBA Switches


    Programmable Logic:

    System GatesDSP, RAM

    Processing System

    Multi Standards IOs (3.3V & High Speed 1.8V) Multi Gigabit Transceivers


    ti St





    & H

    igh S




    Cortex-A9 MPCore32/32/KB I/O Caches

    NEON/FPU Engine

    Dynamic Memory ControllerDDR2, DDR3, LPDDR2





    2x SDIOwith DMA

    2x USBwith DMA

    2x GigEwith DMA

    Figure 1 Unlike previous chips that combine MPUs in an FPGA fabric, Xilinxs new Zynq-7000 EPP family lets the ARM processor, rather than the programmable logic, run the show.

  • FPGAs. Many of those attempting tohold onto their old ASIC ways areimplementing their designs in olderprocess geometries in what analystscall value-minded SoC ASICs. AnyASIC still requires lengthy designcycles and is at risk of multiplerespinswhich can be expensive andcan delay products from going tomarket in a timely manner. WithZynq-7000 EPP on 28 nm, the pro-grammable logic portion of thedevice has no size or performancepenalty vs. older technologies, andyou also get the added benefit of ahardened 28-nm SoC in the process-ing subsystem. With a starting pricepoint below $15, we are really mak-ing it hard for companies to justifythe cost and risk of designing anyASIC that is not extremely high vol-ume, said Getman. You can get yoursoftware and hardware teams up andrunning from day one. That alonemakes it hard for engineering teamsto justify staying with ASICs.

    Getman notes that ever since Xilinxannounced the architecture last year,interest in and requests for the Zynq-7000 EPP have been remarkable. Aselect number of alpha customers arealready prototyping systems that willuse Zynq-7000 devices. The technologyis very exciting.

    SMART ARCHITECTURALDECISIONSThe Zynq-7000 EPP design team,under the direction of VidyaRajagopalan, vice president of pro-cessing solutions at Xilinx, designed avery well-thought-out architecture forthis new class of device. Above andbeyond choosing the ubiquitous and

    immensely popular ARM processorsystem, a key architectural decisionwas to extensively use the high-band-width AMBA Advanced ExtensibleInterface (AXI) interconnectbetween the processing system andthe programmable logic. This enablesmultigigabit data transfers betweenthe ARM dual-core Cortex-A9MPCore processing subsystem andthe programmable logic at very lowpower, thereby eliminating commonperformance bottlenecks for control,data, I/O and memory.

    In fact, Xilinx teamed with ARM tomake the ARM architecture an evenbetter fit for FPGA applications. AXI4has a memory-mapped version and astreaming version, said Rajagopalan.Xilinx drove the streaming definitionfor ARM because a lot of IP that peo-ple build for applications such as high-bandwidth video is streaming IP. ARMdidnt have a product that had thisstreaming interface so they partneredwith us to do it.

    Getman said another key aspect ofthe architecture is that Xilinx hard-ened a healthy mix of standard inter-face IP into Zynq-7000 EPP silicon.We tried to choose peripherals thatwere more ubiquitousthings likeUSB, Ethernet, SDIO, UART, SPI, I2Cand GPIO are all pretty standard, saidGetman. The one exception is that wealso added CAN to the device. CAN isone of the more specialized hardenedcores, but it is heavily used in two ofour key target markets: industrial andautomotive. Having it hardened in thedevice is just one more comfort factorof the Zynq-7000 EPP.

    In terms of memory, Zynq-7000devices offer up to 512 kbytes of L2

    cache that is shared by both proces-sors. The Zynq-7000 EPP devices have256 kbytes of scratchpad, which is ashared memory that the processor andFPGA can both access, said Getman.

    A unique multistandard DDR con-troller supports three types of double-data-rate memory. Where most ASSPstarget a particular segment of a market,we target LPDDR2, DDR2 and DDR3,so the user can make the trade-off ofwhether they want to go after power orperformance, said Rajagopalan. It is amultistandard DDR controller and weare one of the first companies to offer acontroller like this.

    In addition to being a new class ofdevice, Zynq-7000 EPP is also the lat-est Xilinx Targeted Design Platform,offered with base developmentboards, software, IP and documenta-tion to get customers up and runningquickly. Further, the company willroll out over the coming years verti-cal-market- and application-specificZynq-7000 EPP Targeted DesignPlatformsboards or daughtercards,IP and documentationall to helpdesign teams get products to marketfaster (see cover story, Xcell JournalIssue 68,

    Xilinx Alliance Program membersand the ARM Connected Communitywill also offer customers a wealth ofZynq-7000 EPP resources, includingpopular operating systems, debuggers,IP, reference designs and other learn-ing and development materials.

    In addition to creating great siliconand the tools to go with it, Xilinx hasmeticulously put together user-friend-ly design and programming flows forthe Zynq-7000 EPP.

    Second Quarter 2011 Xcell Journal 11

    C O V E R S T O R Y

    'With a starting price point below $15, we are really making it hard for companies to justify the cost and risk of designing any ASIC that is not extremely high volume.'

  • PROCESSOR-CENTRICDEVELOPMENT FLOWThe Zynq-7000 EPP relies on a famil-iar tool flow that allows embedded-software and hardware engineers toperform their respective develop-ment, debug and implementationtasks in much the same way as theydo nowusing familiar embedded-design methodologies already deliv-ered through the Xilinx ISE DesignSuite and third-party tools (Figure 2).

    Getman notes that software applica-tion engineers can use the same devel-opment tools they have employed forprevious designs. Xilinx provides theSoftware Development Kit (SDK), an

    Eclipse-based tool suite, for embedded-software application projects. Engineerscan also use other third-party develop-ment environments, such as the ARMDevelopment Studio 5 (DS-5), ARMRealView Development Suite (RVDS)or any other development tools from theARM ecosystem

    Linux application developers canfully leverage both the Cortex-A9 CPUcores in Zynq-7000 devices in a sym-metric-multiprocessor mode for thehighest performance. Alternatively,they can set up the CPU cores in auniprocessor or asymmetric-multi-processor mode running Linux, a real-time operating system (RTOS) like

    VxWorks or both. To jump-start soft-ware development, Xilinx providescustomers with open-source Linuxdrivers as well as bare-metal driversfor all the processing peripherals(USB, Ethernet, SDIO, UART, CAN,SPI, I2C and GPIO). Fully supportedOS/RTOS board support packages withmiddleware and application softwarewill also be available from the Xilinxand ARM partner ecosystem.

    Meanwhile, the hardware designflow is similar to the embedded-proces-sor design flow in the ISE Design Suite,with a few new steps for the ExtensibleProcessing Platform. The processingsubsystem is a complete dual-processor

    C O V E R S T O R Y

    One Processing System, Four Devices Each of the Zynq-7000 EPP familys four devices has the exact same ARM processing system, but the programmablelogic resources vary for scalability and fit different applications (see figure).

    The Cortex-A9 Multi-Processor core (MPCore) consists of two CPUseach a Cortex A9 MPCore processor withdedicated NEON coprocessor (a media and signal-processing architecture that adds instructions targeted at audio,video, 3-D graphics, image and speech processing) and a double-precision floating-point unit. The Cortex-A9 proces-sor is a high-performance, low-power, ARM macrocell with a Level 1 cache subsystem that provides full virtual-mem-ory capabilities. The processor implements the ARMv7 architecture and runs 32-bit ARM instructions, 16-bit and32-bit Thumb instructions, and 8-bit Java byte codes in Jazelle state. In addition, the processing system includes asnoop control unit, a Level 2 cache controller, on-chip SRAM, timers and counters, DMA, system control registers,device configuration and an ARM CoreSight system. For debug, it contains an embedded trace buffer, instrumen-tation trace macrocell and cross-trigger module from ARM, alongwith AXI monitor and fabric trace modules from Xilinx.

    The two larger devices, the Zynq-7030 and Zynq-7040, includehigh-speed, low-power serial connectivity with built-in multigigabittransceivers operating at up to 10.3125 Gbits/second. These devicesoffer approximately 1.9 million and 3.5 million equivalent ASIC gates(125k and 235k logic cells) respectively, along with DSP resourcesthat deliver 480 GMACs and 912 GMACs respectively of peak per-formance. The two smaller devices, the Zynq-7010 and Zynq-7020,provide roughly 430,000 and 1.3 million ASIC-gate equivalents (30kand 85k logic cells) respectively, with 58 GMACs and 158 GMACs ofpeak DSP performance.

    Each device contains a general-purpose analog-to-digital converter (XADC)interface, which features two 12-bit, 1-Msample/s ADCs, on-chip sensors andexternal analog input channels. The XADC offers enhanced functionality overthe system monitor found in previous generations of Virtex FPGAs. The two12-bit ADCs, which can sample up to 17 external-input analog channels, sup-port a diverse range of applications that need to process analog signals withbandwidths of less than 500 kHz.

    12 Xcell Journal Second Quarter 2011

    The Zynq-7000 Extensible Processing Platform debuts with a family of four devices. All sport the same ARM processing system but vary in programmable logic resources. Gate counts range from 430,000 to 3.5 million equivalent ASIC gates.

  • system with an extensive set of com-monly used peripherals. Hardwaredesigners can extend the processingpower by attaching additional soft-IPperipherals in the programmable logicto the processing subsystem. Thehardware development tool XilinxPlatform Studio automates many ofthe common hardware developmentsteps and can also assist designerswith optimized device pinouts. Wevealso added to ISE some abilities to co-debug for hardware breakpoints andcross-triggering, said Getman. Themost important thing for us was togive software developers and hard-ware designers their comfortabledesign environments.

    A MINDFUL PROGRAMMINGMETHODOLOGY In the Xilinx scheme of things, userscan configure the programmable logicand connect it to the ARM corethrough AXI interconnect blocks toextend the performance and capabili-ties of the processor system. TheXilinx and ARM partner ecosystemprovides a large set of soft AMBA inter-face IP cores for implementation in theFPGA programmable logic. Designerscan use them to build any custom func-tions their targeted applicationrequires. Because the device usesfamiliar programmable logic structuresfound in the 7 series FPGAs, designerscan load a single static programmablelogic configuration, multiple configura-tions or even employ partial reconfigu-ration techniques to allow the deviceto reprogram programmable logicfunctionality as needed on the fly.

    The interconnect operationbetween the two regions of the deviceis largely transparent to the designers.

    Access between master and slave isrouted through the AXI interconnectbased on the address range assigned toeach slave device. Multiple masterscan access multiple slaves simultane-ously, and each AXI interconnect usesa two-level arbitration scheme toresolve contention.

    GET READY, GET IN EARLYCustomers can start evaluating theZynq-7000 EPP family today by joiningthe Early Access program. First sili-con devices are scheduled for the sec-ond half of 2011, with general engi-

    neering samples available in the firsthalf of 2012. Designers can immedi-ately use tools and development kitsthat support ARM to familiarize them-selves with the Cortex-A9 MPCorearchitecture and begin porting code.

    Pricing varies and depends on vol-ume and choice of device. Based onforward volume production pricing,the Zynq-7000 EPP family will have anentry point below $15 in high vol-umes. Interested customers shouldcontact their local Xilinx representa-tive. For more information, please

    Second Quarter 2011 Xcell Journal 13

    C O V E R S T O R Y

    The most important thing for us was to give software developers and hardware designers their

    comfortable design environments.

    Software Developer


    Integrate IP



    Hardware Designer


    Integrate IP



    System Architect

    Custom IP

    Xilinx IP

    Partner IP


    Figure 2 The Zynq-7000 EPP relies on a familiar tool flow for systemarchitects, software developers and hardware designers alike.

  • 14 Xcell Journal Second Quarter 2011

    High-PerformanceFPGAs Take Flight in Microsatellites


    Artists rendition of the UKube1 CubeSat in flight

  • Second Quarter 2011 Xcell Journal 15

    The UKube1 mission is the pilot mission for the U.K. SpaceAgencys planned CubeSat program. CubeSats are a class ofnanosatellites that are scalable from the basic 1U satellite(10 x 10 x 10 cm) up to 3U (30 x 10 x 10 cm) and beyond, and whichare flown in low-earth orbit. The typical development cost of aCubeSat payload is less than $100,000, and development time isshort. This combination makes CubeSats an ideal platform for veri-fying new and exciting technologies in orbit without the associatedoverhead or risks that would be present in flying these payloads ona larger mission. Of course, this class of satellites can present itsown series of design challenges for the engineers involved.

    The EADS Astrium payload for the UKube1 mission comprisestwo experiments, both of which are FPGA based. The first experi-ment is the validation of a patent held by Astrium on random-num-ber generation. True random-number generation is an essential com-ponent of secure communications systems. The second experimentis the flight of a large, high-performance Xilinx Virtex-4 FPGAwith the aim of achieving additional in-flight experience with thistechnology while gaining an understanding of the devices radiationperformance and capabilities in the low-earth orbit (LEO). Figure 1shows the architecture of the payload.

    Utilizing XilinxVirtex-4 devices in a U.K. CubeSatmission presentssome interestingdesign challenges

    by Adam TaylorPrincipal EngineerEADS [email protected]

    X C E L L E N C E I N A E R O S P A C E

  • REQUIREMENTS AND CHALLENGESDesigning for a CubeSat mission pro-vides the engineers with many chal-lenges, not least of which is the avail-able powerthere is not much of it, at400 milliwatts on average in sunlightorbit for the UKube1. Weight restric-tions come in at a shade over 300grams for a 3U, 4.5-kg satellite.Combined with the space envelopeavailable for a payload, these limita-tions present the design team with an

    interesting set of challenges to addressif they are to develop a successful pay-load. The engineers must also addresssingle-event upsets (SEUs) and otherradiation effects, which can affect theperformance of a device in orbitregardless of the class of satellite.

    The power architecture in UKube1provides regulated 3.3-, 5- and 12-voltsupplies to each of the payloads. It ispermissible to take up to 600 mA fromeach of these rails. However, the sunlit-orbit average must be less than 400 mW.

    Unfortunately, the voltages supplied arenot at levels suitable to supply todayshigh-performance space-grade FPGAs,which typically require 1.5 V or less tosupply the core voltage. For example, theVirtex-4 space-grade device we selectedfor this mission, the XQR4VSX55 FPGA,requires a core voltage of 1.2 V alongwith supporting voltages of 2.5 and 3.3 V.The configuration PROMs needed to sup-port the FPGA require 1.8 V.

    We selected the XQR4VSX55 becauseit was the largest high-performance

    16 Xcell Journal Second Quarter 2011

    X C E L L E N C E I N A E R O S P A C E



    FPGA 3V3

    FPGA 3V3_Xilinx


    FPGA 1V5

    LinearRegulator FPGA 1V2

    LinearRegulator FPGA 2V5

    LinearRegulator PROM 1V8




    Payload I2C


    3V33V3 & 1V5 Data to be checked


    Configuration Data

    Address and Data


    Figure 1 The EADS Astrium payload for the UKube1 mission comprises two experiments, both of which are FPGA based.

  • FPGA that could be accommodatedwithin the UKube1 payload while stillachieving both the footprint and powerrequirements. The power engineer andthe FPGA engineer must give consider-able thought to the power architectureto ensure all of the power constraintsare achieved. Tools such as thePlanAhead and XPower Analyzersoftware are vital for the FPGA engi-neer to provide FPGA power budgets tothe power engineer. We chose high-effi-ciency switching regulators for this mis-sion to ensure we could achieve thecurrents required by the SX55 FPGAcould be achieved.

    The space available to implement theFPGA and its supporting functions isvery limited, with the payload being con-strained to a PC104-size printed-circuitboard. However, mezzanine cards arepermitted provided they do not exceedthe height restrictions of 35 mm.Therefore, we developed the UKube1payload to include a mezzanine card thatcontained the FPGAs, SRAM and flashmemory, while the lower board con-tained all of the power management andconversion functionality (see Figure 2).

    RADIATION EFFECTSOne of the most important aspects ofthis mission is to gain an understandingof the performance of the Virtex-4device in a LEO environment. WhileXilinx has hardened this FPGA forspaceflight, SEUs will still occur andaffect both the configuration data andthe FPGA registers, RAM and digitalclock managers (DCMs). This missiontherefore configures the FPGA to usemost of the internal logic, RAM andDCMs. We then monitor the perform-ance of this device using another one-time programmable hardened device,which passes performance statisticsback down to the ground for analysis.Interestingly, on this mission we areless concerned with SEU- and radia-tion-mitigation design techniques thanon ensuring that these events can becaptured such that they can be detect-ed by a monitoring FPGA, allowing for

    the generation of real in-flight per-formance statistics.

    This aspect of the design presentedsome interesting engineering chal-lenges. For example, an SEU couldaffect signals such as the DCM lockedsignal and incorrectly indicate that theDCM has been affected by an SEU,when in reality it has not. To counterthis potential problem, the FPGA designengineers came up with a method ofusing counters to monitor the DCMsand make sure they are locked to thecorrect frequency. Should the counterfreeze or increment at a different fre-quency, it is indicated by comparing itwith the other counters, allowing for anFPGA reconfiguration if required.

    FUTURE DEVELOPMENTS The U.K. Space Agency will launchUKube1 in January 2012. The CubeSatwill provide data on both of our exper-iments for at least the mission lifetimeof one year. This mission will demon-strate the suitability of high-perform-ance FPGAs for use in LEO missions

    and microsatellite architectures aswell as hopefully allowing the use ofhigh-performance FPGAs in other mis-sions of longer duration.

    The UKube1 is expected to be justthe first of a national CubeSat pro-gram in the U.K. The architecturedeveloped for UKube1 lends itself toadaptation for use on future missionsto gain further understanding of, andexperience in, using large, high-per-formance FPGAs in orbit. Potentialadaptations for the next missionwould include a demonstration of in-orbit partial reconfiguration and in-orbit readback and verification of thedevice configuration. Unfortunately,due to the time scales involved in thedevelopment of UKube1, these exper-iments could not be incorporated onthis maiden voyage.

    The UKube1 architecture using theXilinx Virtex-4 family of devices alsolends itself to evolution into a system-on-chip-based controller that couldserve as a micro- or nanosatellite mis-sion controller.

    Second Quarter 2011 Xcell Journal 17

    X C E L L E N C E I N A E R O S P A C E

    Figure 2 The FPGAs, SRAM and flash memory reside on a mezzanine card (top), while the power management and

    conversion functionality occupies the lower board.

  • 18 Xcell Journal Second Quarter 2011

    I n many modern applications,embedded systems have to meetextremely tight timing specifica-tions. One of these timing require-ments is the startup timethat is, thetime it takes for the electronic systemto be operative after power-up.Examples of electronic systems with atight startup-timing specification arePCI Express products or CAN-basedelectronic control units (ECUs) inautomotive applications.

    Just 100 milliseconds after poweringup a standard PCIe system, the rootcomplex of the system starts scanningthe bus in order to discover the topolo-gy and, in so doing, initiate configura-

    tion transactions. If a PCIe device is notready to respond to the configurationrequests, the root complex does notdiscover it and treats it as nonexistent.The device wont be able to participateon the PCIe bus system. [1]

    The automotive scenario is simi-lar. In CAN-based networks, ECUsgo into a sleep mode where they shutdown and disconnect from theirpower supply. Only a small bit of cir-cuitry remains alert to detect wake-up signals. Whenever a wake-upevent occurs, ECUs reconnect topower and start booting. While it isacceptable to miss any messagesduring the first 100 ms after the

    wake-up event, afterward all ECUsmust be fully operational on the net-work (for example, CAN).

    Advanced development workbetween Xilinx Automotive, XilinxResearch Labs and the KarlsruheInstitute of Technology in Karlsruhe,Germany is addressing this problemwith a two-step FPGA configurationmethodology.

    The technology trends in the semi-conductor industry have enabledFPGA manufacturers to significantlyincrease the resources in their devices.But the bitstream size grows propor-tionally, and so does the time it takesto configure the device. Therefore,

    Advanced research work demonstrates a new, two-step configuration method thatmeets the tight startup timing specs of automotive and PCIe applications.


    by Joachim MeyerPhD candidateKarlsruhe Institute of Technology [email protected]

    Juanjo NogueraSenior Research EngineerXilinx, Inc. [email protected]

    Rodney StewartAutomotive Systems ArchitectXilinx, Inc. [email protected]

    Michael HbnerDr-Ing.Karlsruhe Institute of Technology [email protected]

    Jrgen BeckerProfessor Dr-Ing.Karlsruhe Institute of Technology [email protected]

    Fast Startup for Xilinx FPGAs

  • Second Quarter 2011 Xcell Journal 19

    even with midsize FPGAs, it is not pos-sible to meet demanding startup-tim-ing specifications using low-cost con-figuration solutions. Figure 1 showsthe configuration time for differentXilinx Spartan-6 FPGAs using thelow-cost SPI/Quad-SPI configurationinterface. Even when using a fast con-figuration solution (namely, Quad-SPIrunning at a 40-MHz configurationclock), only the small FPGAs meet the100-ms startup-timing specification.For Xilinx Virtex-6 devices theresults look even more challenging,since these devices provide an abun-dance of FPGA resources.

    To address this challenge, FastStartup configures the FPGA in twosteps, rather than a single (monolithic)full-device configuration. In this novelapproach, the strategy is to load onlythe timing-critical modules at power-upusing the first high-priority bitstream,and then load the non-timing-criticalmodules afterward. This techniqueminimizes the initial configuration dataand, thus, minimizes the FPGA startuptime for the timing-critical design.

    FAST STARTUP VS. PARTIALRECONFIGURATIONFast Startup enables an FPGA designto start critical modules in the designas fast as possible, much faster thanin a standard full-configuration tech-nique. [2] Although Fast Startup fun-damentally makes use of partialreconfiguration, it differs from thetraditional concepts of this tech-nique. Partial reconfiguration intendsa full design to be used as an initialconfiguration that can be modifiedduring runtime. By contrast, FastStartup already uses an initial partialbitstream in order to configure justone specific (small) region of theFPGA at power-up. The first configu-ration contains only those parts ofthe full FPGA design that must be upand running quickly. The remainderis configured later, during runtime,using partial reconfiguration. Figure2 illustrates this sequential concept.

    TOOL FLOW OVERVIEWThe tool flow for Fast Startup relieson the design preservation flow tocreate the partial bitstreams for boththe timing-critical and the non-timing-critical subsystems.

    The design preservation flowdivides the FPGA design into logicalmodules called partitions. Partitionsform hierarchical boundaries, isolat-ing the modules inside from the othercomponents of your design. A parti-tion, once implemented (that is,

    placed and routed), can be importedby other implementation runs in orderto implement the modules of the parti-tion in exactly the same way in eachinstance. [3]

    Therefore, the first step in using theFast Startup technique is to separatethe complete FPGA design into twosections: a high-priority partition con-taining the timing-critical subsystemand a low-priority partition for the restof the components.

    IMPLEMENTING THE HIGH-PRIORITY PARTITIONTo obtain a partial bitstream for thehigh-priority partition that is as smallas possible, there are some generaldesign considerations to take intoaccount. First of all, the partitionmust include only components thatare timing-critical or that the systemneeds to perform the partial reconfig-uration of the low-priority section(for example, the ICAP). The key toobtaining a small initial partial bit-

    stream is to use an area as small aspossible to implement the high-priori-ty partition. That means you mustconstrain this partition to an appro-priate region in the FPGA.

    In order to find a good physicallocation on the FPGA, this area shouldprovide just the right amount ofresources you need for the design.Accessing resources located outsideof this area is possible but discour-agedthough generally will be

    X C E L L E N C E I N A U T O M O T I V E A P P L I C AT I O N S








    SPI x 1 cclk 2MHz

    SPI x 1 cclk 40MHz

    SPI x 4 cclk 2MHz

    SPI x 4 cclk 40MHz

    LX9 LX16 LX25T LX45T LX75T LX100T LX150T

    Figure 1 Logarithmic illustration of calculated Spartan-6 configuration times (worst-case calculation)

    Time Line:

    Device Power-Up Timing-CriticalSubsystem Running

    Non-Timing-CriticalSubsystem Configured

    P1 P1


    Figure 2 Fast Startup concept: a sequential configuration

  • inevitable in the case of I/O pins. Whenlooking for an appropriate area, alsokeep in mind that this FPGA regionmight block resources for the non-tim-ing-critical part of the FPGA design.

    When you have partitioned theFPGA and found appropriate areas forthe partitions, the next step is toimplement the high-priority partition,using an empty (black box) low-prior-ity partition. The resulting bitstreamcontains a lot of configuration framesfor resources that are not used. Youcan remove those frames in order toget a valid partial bitstream for the ini-tial configuration of the high-prioritypartition. [4]

    IMPLEMENTING THE LOW-PRIORITY PARTITIONTo create the partial bitstream for thelow-priority partition, first you have tocreate an implementation of the fullFPGA design containing both parti-tions, the high and the low priority.Import the high-priority partition fromthe previous implementation in orderto make sure it is implemented in thesame way as before.

    For Virtex-6, you will use the partialreconfiguration (PR) flow for all thedescribed implementations. As aresult, the partial bitstream for thelow-priority partition is obtained auto-matically. Since the Spartan-6 device

    family does not support the PR flow, weused the BitGen option for difference-based partial reconfiguration to obtainthe partial bitstream for the low-prioritypartition when implementing FastStartup for our Spartan-6 design. [5]Figure 3 gives a high-level overview ofthe tool flow.

    EXPERIMENTS AND RESULTSFor verifying the Fast Startup configura-tion technique in hardware, our teamimplemented it on a Virtex-6 ML605 boardas well as on a Spartan-6 SP605 board.

    The application scenario for a Virtex-6 implementation derives from thevideo domain. Whenever users power

    20 Xcell Journal Second Quarter 2011

    X C E L L E N C E I N A U T O M O T I V E A P P L I C AT I O N S







    PartialBitstreamsfor FastStartup








    Build Full Bitstream Build Partial Bitstream

    Build Partial Bitstream

    Figure 3 Fast Startup tool flow

    Low-Priority Partition





    AXI 2 AXI



    Memory High-Priority Partition

    I D




    CAN / TFTController









    Clocks for P2design



    clk TFTclk



    Figure 4 Basic block diagram of Virtex-6 and Spartan-6 demonstrator (the Virtex-6 includes the TFT block; the Spartan-6 includes the CAN block only)

  • up a video system they want to see animmediate response, and not waitseveral seconds. Therefore, in thesystem illustrated in Figure 4, a high-priority processor subsystemequipped with a TFT controller bringsup a TFT screen as quickly as possi-ble. For additional, low-priority appli-cations, the second design providesaccess to an Ethernet core, UART andhardware timer.

    For this demonstrator we used anexternal flash memory with BPI as ourconfiguration interface. As soon as theinitial high-priority bitstream config-ures the processor subsystem, soft-ware running out of BRAM initializesthe TFT controller and writes data intothe frame buffer located in the DDRmemory. This ensures the screen willshow up on the TFT as fast as possibleat startup. Afterward, the second bit-stream is read from the BPI flashmemory and the low-priority partitionis configured to enable the processorsubsystem to run other applications,such as a Web server.

    For easy expansion and a clean sep-aration of the two partitions, we usedan AXI-to-AXI bridge. This also mini-mized the nets crossing the border ofthe two design partitions. The low-pri-ority partition shares its system clockwith the high-priority partition.

    Table 1 shows the FPGA resourceutilization while Table 2 presents con-figuration times for the traditionalstartup, the startup with a com-pressed bitstream of the high-prioritypartition only [6] and the Fast Startupconfiguration technique. In all cases,the configuration interface wasBPIx16, while the applied configura-tion ratean option to determine thetarget configuration clock frequen-cywas 2 and 10 MHz. We measuredthe data by using an oscilloscope tocapture the init and the done sig-nals of the FPGA.

    The Compressed column in Table2 represents a compressed bitstreamof the high-priority partition only. Acompressed bitstream of the fullFPGA design containing both parti-tions would be 3.1 Mbytes.

    SPARTAN-6 AUTOMOTIVE ECU DESIGN To verify the Fast Startup techniquefor Spartan-6, we chose the ECUapplication scenario from the automo-tive domain. Whenever you find anFPGA in an automotive electroniccontrol unit, usually it is used by the

    Second Quarter 2011 Xcell Journal 21

    X C E L L E N C E I N A U T O M O T I V E A P P L I C AT I O N S

    Resource PartitionType High Priority % Low Priority %

    Flip-Flops 8,849 2.9 1,968 0.7

    Lookup Tables 7,039 4.7 2,197 1.5

    I/Os 135 22.5 20 3.3

    RAMB36s 34 8.2 2 0.5

    XC6VLX240T Configuration Technique

    Configuration Traditional Compressed Fast StartupInterface 8.9 MB 2.0 MB 1.4 MB

    BPIx16 CR2 1,740 ms 389 ms 278 ms

    BPIx16 CR10 450 ms 112 ms 84.4 ms

    Table 1 Occupied FPGA resources for XC6VLX240T

    Table 2 Measured configuration times for Virtex-6 video design

    VBATT(Vehicle Battery)



    Stage 1: Ultra LowCurrent Circuitry

    (always on)





    V+ V+ V+


    Ultra LowCurrent

    PSU Stage




    Main Power Supply Stage



    Figure 5 FPGA use in todays automotive ECUs and with the processor incorporated into the FPGA (dotted lines)

  • main application processing unit of theECU only (see Figure 5). Our goal wasto implement a design that also put thesystem processor inside the FPGA. Bydoing so, we could eliminate the needfor an external processor, therebyreducing overall system cost, complex-ity, space and power consumption.

    SYSTEM PARTITIONINGFor this scenario, the partitioning ofthe system is obvious. We separatedour ECU design into a system proces-sor part as our high-priority partitionand an application-processing part asthe low-priority partition.

    This design had a lot of similaritiesto the Virtex-6 design, but instead ofBPI, we used SPI as an interface tothe external flash memory andinstead of a TFT controller, a CANcontroller was necessary. Uponpower-up, the system controller hasonly a limited amount of time to bootand be ready to process the first com-munication data. For ECUs using theCAN bus for communications thisboot time limit is typically 100 ms.With traditional configuration tech-niques it is hard to beat this tight tim-ing specification using a big Spartan-6with a low-cost configuration inter-face like SPI or Quad-SPI. But using afaster and therefore more expensiveconfiguration interface is unaccept-able in the automotive domain.

    MEASUREMENT SETUPFor the SP605 automotive ECUdemonstrator we performed measure-ments in our labs; Figure 6 presents

    the measurement setup. On the leftside is an X1500 automotive platformbased on a Spartan-3 implementing atraffic generator for the CAN bus,which is able to send and receive CANmessages and measure time betweenthem using hardware timers. On theright side is the target platform, whichis not connected directly to the CANbus but uses the CAN transceiver froman additional custom board. Besidesproviding a CAN PHY, this customboard also controls the power supplyof the target board.

    The procedure to measure the con-figuration time starts with the traffic

    generator in idle status, the CAN trans-ceiver on the CAN PHY board in sleepmode and therefore the SP605 discon-nected from power. In the next step,the traffic generator starts a hardwaretimer and sends a CAN message.Recognizing the activity on the CANbus, the CAN PHY awakens and recon-nects the SP605 to the power supply.The FPGA then starts to load the initialbitstream from SPI flash.

    Because there is no receiveracknowledging the message sent bythe traffic generator, the message isresent immediately until the FPGAhas finished its configuration andalso configured the CAN core withthe valid baud rate. Whenever theCAN core of the Spartan-6 designacknowledges the message, the CANcore of the traffic generator triggersan interrupt, which stops the hard-ware timer. This timer is now holdingthe boot time for the SP605 design.Measurements, which included anadditional hardware timer inside theSP605 design, have shown that whenexecuting the software to configure

    X C E L L E N C E I N A U T O M O T I V E A P P L I C AT I O N S

    PC TrafficGeneratorCANPHY1Mb/s CAN



    12 V

    12 V



    Figure 6 Measurement setup for the automotive ECU

    Resource PartitionType High Priority % Low Priority %

    Flip-Flops 3,480 6% 1,941 4%

    Lookup Tables 3,507 13% 1,843 7%

    I/Os 58 20% 20 7%

    RAMBs 12 10% 2 2%

    Table 3 Occupied FPGA resources in Spartan-6 design

    Configuration TechniqueConfiguration Traditional Compressed Fast Startup

    Interface 1,450 KB 920 KB 314 KB

    SPIx1 CR2 5,297 ms 3,382 ms 1,157 ms

    SPIx1 CR26 292 ms 196 ms 85 ms

    SPIx2 CR2 2,671 ms 1,699 ms 596 ms

    SPIx2 CR26 161 ms 113 ms 58 ms

    SPIx4 CR2 1,348 ms 872 ms 311 ms

    SPIx4 CR26 97 ms 73 ms 45 ms

    Table 4 Measured Spartan-6 configuration times

    22 Xcell Journal Second Quarter 2011

  • the CAN core from internal BRAMmemory, the software startup timewas negligible.

    Table 3 presents the FPGA resourceconsumption for each partition. Thepercentage information refers to thetotal amount of available resources ofthe used XC6S45LXT device.

    Table 4 shows the results of theconfiguration time measurements.For these measurements, we imple-mented and compared a standard bit-stream of the full design, a com-pressed bitstream of the full designand the Fast Startup technique usinga partial initial bitstream. The tablelists the configuration times for dif-ferent SPI bus widths and differentconfiguration rate (CR) settings. Asexpected, the configuration times areproportional to the bitstream sizes.Because using a fast configurationclock does not affect the houseclean-ing process, the ratio in percentagechanges for high-CR settings.

    VERIFIED IN HARDWAREThe advanced configuration techniquethat we developed might be called pri-oritized FPGA startup, since it config-ures the device in two steps. Such atechnique is essential to address thechallenge of increasing configurationtime in modern FPGAs and to enabletheir use in many modern applica-tions, like PCI Express or CAN-basedautomotive systems.

    Besides proposing the high-priorityinitial configuration technique, we veri-fied it in hardware. We have used andtested the tool flow and methods forFast Startup to implement a CAN-basedautomotive ECU on a Spartan-6 evalua-tion board (the SP605) as well as avideo design on a Virtex-6 prototypingboard. By using this novel approach,we decreased the initial bitstream size,and hence, achieved a configurationtime improvement of up to 84 percentwhen compared with a standard full-configuration solution.

    Xilinx will support the FastStartup concept for PCI Express

    applications in the software for thenew 7 series FPGAs, with improvedimplementation techniques to simpli-fy its use. In the 7 series, the new two-stage bitstream method is the simplestand least expensive technique toimplement. The user directs theimplementation tools to create a two-stage bitstream via a simple softwareswitch when building the FPGAdesign. The first stage of the bitstreamcontains just the configuration framesnecessary to configure the timing-crit-ical modules. When configured, anFPGA STARTUP sequence occurs andthe critical blocks becomes active,thus easily satisfying the 100-ms tim-ing requirement. While the timing-crit-ical modules are working (e.g., PCIExpress enumeration/configurationsystem process is occurring), theremainder of the FPGA configurationgets loaded. The two-stage bitstreammethod can use an inexpensive flashdevice to hold the bitstreams.


    [1] PCI Express Base Specification, Rev.1.1, PCI-SIG, March 2005

    [2] M. Huebner, J. Meyer, O. Sander, L.Braun, J. Becker, J. Noguera and R.Stewart, Fast sequential FPGA startupbased on partial and dynamic reconfigu-ration, IEEE Computer Society AnnualSymposium on VLSI (ISVLSI), July 2010

    [3] Hierarchical Design MethodologyGuide, UG748, v12.1, Xilinx, May 2010

    [4] B. Sellers, J. Heiner, M. Wirthlin andJ. Kalb, Bitstream compression throughframe removal and partial reconfigura-tion, International Conference on FieldProgrammable Logic and Applications(FPL), September 2009

    [5] J. Meyer, J. Noguera, M. Huebner, L.Braun, O. Sander, R. Mateos Gil, R.Stewart, J. Becker, Fast Startup forSpartan-6 FPGAs using dynamic partialreconfiguration, Design, Automationand Test in Europe (DATE 11), 2011

    [6] Fast Configuration of PCI ExpressTechnology through PartialReconfiguration, XAPP883, v1.0, Xilinx,November 2010,

    Second Quarter 2011 Xcell Journal 23

    X C E L L E N C E I N A U T O M O T I V E A P P L I C AT I O N S

  • 24 Xcell Journal Second Quarter 2011

    Reconfigurable SystemUses Large FPGAComputation Fields


    by Igor A. KalyaevDirectorKalyaev Scientific Research Institute of Multiprocessor Computer SystemsSouthern Federal [email protected]

    Ilya I. LevinDeputy Director of ScienceKalyaev Scientific Research Institute of Multiprocessor Computer SystemsSouthern Federal [email protected]

    Evgeniy A. SemernikovHead of LaboratorySouthern Scientific Centre of theRussian Academy of [email protected]

  • Second Quarter 2011 Xcell Journal 25

    Over the last two decades, many supercom-puter architectures have included bothmicroprocessors, which serve as the mainbrains of the system, and FPGAs, which typicallyoffload some of the compute tasks from thoseCPUs. In fact, this pairing of FPGAs with standalonemicroprocessors has proven to be a winning combi-nation for many generations of supercomputers.

    But recently, our group of scientists at the KalyaevScientific Research Institute of MultiprocessorComputer Systems at Southern Federal University(Taganrog, Russia) designed a reconfigurable com-puter system (RCS) based largely on Xilinx FPGAs.By interconnecting multiple FPGAs, we can createspecial-purpose computational structures that arewell suited to solving problems in a broad numberof application areas.

    An FPGA-based RCS has several interestingdesign considerations and advantages. At its heartare several FPGAs that weve united to form a singlecomputational system. For our reconfigurable sys-tems we use an orthogonal communication system,placing FPGAs in the rectangular lattice points.Direct links interconnect adjacent FPGAs.

    In addition, we create a parallel-pipeline compu-tational structure in the FPGA computation field byusing computer-aided design tools and tools forstructural programming. The architecture of thisparallel-pipeline computational structure is similarto the task structure, and this similarity ensures thehighest performance of the RCS.

    If hardware limitations do not allow mapping ofthe entire informational graph of a task we are try-ing to perform or a problem we are attempting tosolve, then the system will automatically split thegraph into a number of subgraphs. The systems

    X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G

    By interconnecting multiple Xilinx FPGAs, designers

    can build a new kind of supercomputer and tailor

    it for jobs in a wide range of application areas.

  • computation field sees each of thesesubgraphs as individual computa-tional structures, organizes the sub-graphs sequentially and solves themsequentially. Figure 1 illustrates thecomputational structures flow of theRCS and the procedures of datainput and output. As you can see,the size of each subgraph is propor-tional to the size of an FPGA field.Tools we have developed for struc-tural programming automaticallysplit a problem into a smaller num-ber of larger subgraphs and mapthem into hardware for efficientsolving, thus dramatically speedingRCS performance.

    In more detail, the system inter-prets the informational graph of agiven task or problem as a sequenceof isomorphic basic subgraphs, eachof which can be independent ordependent (that is, information-related) on one another. The systemtransforms each informational sub-graph into a special computationalstructure called cadr. Essentially, acadr is a hardwired task subgraphwith an operands flow runningthrough it. Each group of operands(or results) corresponds to input (oroutput) nodes of the specified sub-graph in the sequence. The specialparallel program (or procedure, aswe call it) executes the sequence ofcadrs. Only one parallel program isneeded for the whole RCS.

    We designed our RCS for high-per-formance computing in a modulararchitecture. The RCS consists ofbasic modules that host fragments ofthe computation field and have con-nections with other modules andauxiliary subsystems. Figure 2shows several basic modulesdesigned at the Kalyaev ScientificResearch Institute of MultiprocessorComputer Systems at SouthernFederal University (SRI MCS SFU).

    The FPGAs in the basic modulescommunicate via LVDS channels run-ning at a frequency of 1.2 GHz. Allmodules in an RCS communicate by

    26 Xcell Journal Second Quarter 2011

    X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G

    Information Graph of Task


    Subgraphsafter splitting





    Basic Module 16V5-75140 Gflops, Virtex-5

    XC5VLX110-FFG1153,11*106 gates, 2008

    Basic Module 16V5-5050 Gflops, Virtex-4

    XC4VLX40-FF1148,11*106 gates, 2006

    Basic Module 16V5-125OR250 Gflops, Virtex-5

    XC5VLX110-FFG1153,11*106 gates, 2009

    Figure 2 Several basic RCS modules, all built around Virtex FPGAs

    Figure 1 Computational structures flow of the reconfigurable computer system (RCS)

  • means of these channels. Each basicmodule also includes discrete distrib-uted memory units along with a basicmodule controller to perform resourcemanagement, including loading frag-ments of parallel applications into thedistributed memory controllers andhandling data input and output.

    Our 16V5-75 is the main module inour RCS family. The RCS series con-sists of systems with various per-formance levels ranging from 50gigaflops (using 16 Xilinx Virtex-5FPGAs) to 6 teraflops (1,280 Virtex-5FGPAs). Our highest-end modelholds five ST-1R racks, each rackcontaining four 6U blocks of RCS-0.2-CB and each block consisting of fourbasic 16V5-75 modules. The peak per-formance of the basic module is up to200 gigaflops for single-precisiondata. So, the computation field of theST-1R rack contains 256 Virtex-5FPGAs (XC5VLX110) connected byhigh-speed LVDS (Figure 3).

    Our Orion RCS (Figure 4), mean-while, boasts performance of 20 ter-aflops. It consists of 1,536 Virtex-5FPGAs placed in a 19-inch rack. Themain element of the system is a basicmodule 16V5-125OR with performanceof 250 gigaflops. Basic modules areconnected into a single computationalresource via high-speed LVDS. Fourbasic 16V5-125OR modules are con-nected within a 1U block. In contrastto the basic module 16V5-75, whichhas only four of 16 FPGAs that can beconnected with other basic modules,all 16 FPGAs of the basic module16V5-125OR have LVDS for computa-tion field expandability. This provideshigh data-exchange rates between thebasic modules within the Orion block.

    Table 1 shows the real perform-ance of the block RCS-0.2-CB and

    Orion at solving various tasks: digitalsignal processing, linear algebra andmathematical physics.

    PROGRAMMING THE RCSRCS programming significantly differsfrom programming traditional cluster-

    Second Quarter 2011 Xcell Journal 27

    X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G

    Tasks Specific performance(Gflops/dm3)

    Real performance at task solving(Gflops)




    RCS-0.2-CB 16.7 647.7 423.2 535.2

    Orion 64.9 809.6 528.7 669.0

    CF B

    M 0

    CF B

    M 1

    X1 X2

    X3 X4

    CF B

    M 2

    CF B

    M 3

    CF R



    B #0

    CF B

    M 0

    CF B

    M 1

    X1 X2

    X3 X4

    CF B

    M 2

    CF B

    M 3

    CF R



    B #2

    CF BM 3

    CF BM 2

    X3 X4

    X1 X2

    CF BM 1

    CF BM 0

    CF R



    B #1

    CF BM 3

    CF BM 2

    X3 X4

    X1 X2

    CF BM 1

    CF BM 0

    CF R



    B #3

    Computational Field of Rack ST-1R256 FPGAs: Virtex-5 XC5VLX110

    16 Basic Modules16V5-75

    Figure 3 Computation field of the ST-1R rack containing 256 Virtex-5 FPGAs connected by high-speed LVDS

    Table 1 Performance of the block RCS-0.2-CB and the computing block Orion

    We designed our RCS for high-performance computing in a modular architecture. It consists of basic modules that host fragments of the computation field

    and have connections with other modules and auxiliary subsystems.

  • based supercomputer architectures.Traditional supercomputer architec-tures have fixed hardware circuitryand can only be programmed at a soft-ware level, traditionally by softwareengineers. An FPGA-based RCS hasprogrammable hardware as well soft-

    ware. Therefore, we divide the RCSinto two kinds of programming: struc-tural and procedural. Structural pro-gramming is the task of mapping theinformational graph into the RCS struc-tures computational field (program-ming the FPGAs). Procedural program-ming is traditional programming inwhich users describe the organizationof computational processes in the sys-tem. Typically, software engineers havestruggled when it comes to program-ming FPGAs, and so to competently do

    the job they must have some hardwaredesign skills.

    Knowing this, we sought to make theRCS easier to program by developing aspecial software suite. The suite allowsusers to program the system withoutrequiring any specialized knowledge of

    FPGA and hardware design. Using thissuite, the software developer will getthe solution two to three times fasterthan an engineer who designs a circuitsolution of the task and uses tradition-al methods and tools.

    Based on the development environ-ment Fire!Constructor for circuitdesign, the suite includes a high-levelprogramming language called COLAMOand the Argus integrated developmentenvironment. Users create programs inCOLAMO and a built-in compiler/trans-

    lator/synthesizer automatically cre-ates code for both the structural andprocedural components, with the helpof an IP library of computational unitsand interfaces.

    Libraries of basic module descrip-tions, unit descriptions and a descrip-

    tion of the entire RCS (RCS passport)make it possible to port parallel applica-tions to the RCS from various architec-ture platforms. The COLAMO translatorbreaks the system description downinto four data categories: control, proce-dural, stream and structural.

    The control data is translated intoPascal and is executed in master con-trollers, which, depending on the RCShardware platform, are contained inbasic modules or computing blocks.The control data configures the FPGAs

    28 Xcell Journal Second Quarter 2011

    X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G

    Figure 4 The 20-Tflops Orion computing block

    We sought to make the RCS easier to program by developing a special software suite. The suite allows users to program the system without

    requiring any specialized knowledge of FPGA and hardware design.

  • for the computational tasks at hand,programs distributed memory andmaps the data exchange betweenblocks and basic modules.

    Meanwhile, the procedural andstream components are translated intothe Argus assembler language. Theyare executed by distributed memorycontrollers, which specify the cadrsexecution and create parallel dataflows in the cadrs computationalstructures.

    The most interesting feature of thesuite is the synthesis tool calledFire!Constructor that we developed tohelp users program the FPGAs that lieat the heart of the RCS. Using the IPlibrary, Fire!Constructor generates aVHDL description (*.vhd file) for eachFPGA of the RCS. The Xilinx ISE

    suite uses these files to create configu-ration files (*.bit) for all the FPGAs.

    EFFECTIVE PROBLEM SOLVINGOwing to the principle of structuralprocedural calculations, we caneffectively solve various problems onthe RCS. The most suitable are theproblems that have an invariableinformation graph of the algorithmand variable sets of input data. Forinstance, problems of drug design,digital signal processing, mathemati-cal simulation and the like lend them-selves to RCS solution.

    A spin-off company called ScientificResearch Centre of Supercomputersand Neurocomputers (Taganrog,Russia) delivers these reconfigurablesystems, manufacturing each RCSaccording to preorder specifications.The company can manufacture about100 product units per month.Performance of our RCS, based onVirtex-5 and Virtex-6 FPGAs, is from100 Gflops to 10 Tflops.

    Overall, FPGA-based reconfig-urable computer systems offer uniqueadvantages for solving tasks of variousclasses. With an RCS, users can tailorcomputation to quickly tackle them.To learn more about our RCS, visit

    Second Quarter 2011 Xcell Journal 29

    X C E L L E N C E I N H I G H - P E R F O R M A N C E C O M P U T I N G

    Get on Target

    Hit your target by advertising your product or service in the Xilinx Xcell Journal, youll reach thousands of qualified engineers, designers, and engineering managers worldwide.

    The Xilinx Xcell Journal is an award-winning publication, dedicated specifically to helping programmable logic users and it works.

    We offer affordable advertising rates and a variety of advertisement sizes to meet any budget!

    Call today: (800) 493-5551 or e-mail us at

    [email protected]

    See all the new publications on our website.

    Is your marketing messagereaching the right people?

  • 30 Xcell Journal Second Quarter 2011

    As the largest and most sensitive telescope everenvisioned, the Square Kilometre Array (SKA), aradio telescope under development by a consor-tium of 20 countries, will revolutionize our knowledge ofthe universe through large-area imaging from 70 MHz to a10 GHz when it is deployed in the middle of the nextdecade. On the path to the development of the SKA, theAustralian SKA Pathfinder (ASKAP) is at a key SKA sci-ence frequency band of 700 MHz to 1.8 GHz. ASKAP seeksto, achieve instantaneous wide-area imaging through thedevelopment and deployment of phased-array feed sys-tems on parabolic reflectors. This large field of viewmakes ASKAP an unprecedented survey telescope poisedto achieve substantial advances in SKA key science.

    The Commonwealth Scientific and Industrial ResearchOrganisation, or CSIRO, is an Australian government-fund-ed research agency. A division called CSIRO Astronomyand Space Science (CASS) is undertaking construction ofthe ASKAP (see The com-bination of location, technological innovation and scientif-ic program will ensure that ASKAP is a world-leading radioastronomy facility, closely aligned with the scientific andtechnical direction of the SKA. CSIRO is also involved inthe development of the SKA, which is anticipated to beoperational in 2024.

    Currently under construction in outback WesternAustralia at the Murchison Radio Astronomy Observatory(MRO), ASKAP will consist of an array of thirty-six 12-meterdishes (Figure 1), each employing advanced phased-arrayfeeds (PAFs) to provide a wide field of view: 30 squaredegrees. As opposed to conventional antenna optics, PAFsallow for multiple configurable beam patterns using digital

    Researchers in Australia are using Virtex-6 FPGAs to economically meet the demanding requirements of an advanced telescope called ASKAP.


    by Grant Hampson, PhDASKAP Digital Systems Lead EngineerCSIRO Astronomy and Space [email protected]

    John Tuthill, PhDHardware and Firmware EngineerCSIRO Astronomy and Space [email protected]

    Andrew BrownHardware and Firmware EngineerCSIRO Astronomy and Space [email protected]

    Stephan NeuholdFirmware EngineerCSIRO Astronomy and Space [email protected]

    Timothy BatemanFirmware EngineerCSIRO Astronomy and Space [email protected]

    John Bunton, PhDASKAP Project EngineerCSIRO Information and CommunicationTechnologies [email protected]

    Xilinx FPGAs Beam Up Next-Gen Radio Astronomy

  • Second Quarter 2011 Xcell Journal 31

    electronics to implement beam-formers. ASKAPs location in theSouthern Hemisphere is advanta-geous for astronomy. Combinedwith a very large radio quiet zone(an area where radio frequencyemissions are managed), theresult will be a radio telescopewith unprecedented performance.

    ASKAPs wide field of view andbroad bandwidth will enableastronomers to survey the skyvery quickly, but present a signal-processing challenge two ordersof magnitude greater than that fac-ing existing telescopes. To satisfyperformance demands of approxi-mately 2 peta operations/secondwith 100-Tbit/s communications,Xilinx FPGAs deliver a well-bal-anced mix of these features.FPGAs provide an economicalsolution for both power consump-tion and capital cost comparedwith other alternative solutions,such as high-performance com-puting (HPC).

    STATEMENT OF THE PROBLEMThere are two major problems tosolve with ASKAP: huge com-pute capacity and very high-band-width data communications. It iseasy to find a solution to one, but notto both challenges simultaneously.The Xilinx Virtex-6 family of FPGAsprovides an excellent balance of I/Obandwidth to computational capacityand is well suited to the repetitiveparallel operations in filter banks andbeamformers.

    The ASKAP design is also con-strained by a number of other factorsthat influence the implementation.Power costs at the remote locationand the climate provide extra chal-lenges for cooling instrumentation.Radio astronomy instrumentation alsorequires an operating environmentvery low in electromagnetic interfer-ence (EMI)telescopes are sensitive

    to extremely low levels of radiation.The technological advances in Virtex-6FPGAs that reduce operating voltagesand I/O swings help reduce EMI further.A second advantage in using Virtex-6FPGAs is that the resulting lower-powersystems require less cooling, a doublesaving for an astronomy instrumentthat operates 24/7.

    SYSTEM DESCRIPTIONWe have taken the divide-and-conquerapproach to determine a cost-effec-tive on-time solution. Figure 2 illus-trates that there are two digital signal-processing locationsantenna-basedelectronics and a central processingsite. The antenna-based electronicssample and process the outputs of 188PAF receptors plus four calibration

    signals. It transports the dataover optical fiber to the cen-tral processing site.

    In the central site, beamform-ers for each antenna form up to36 dual-polarization beams onthe sky. The outputs of all theantenna beamformers then con-nect to a correlator, whichforms complex-valued visibili-ties that are used to createimages of the sky.

    Within each antenna are anumber of receiving cards thatprocess a small number of PAFreceptors or, in the case of thebeamformer, a small frequencyslice of the overall input band-width. This technique dividesthe data communications prob-lem while also reducing the costof each processing card.

    DIGITAL RECEIVER DESIGNThe digital receiver subsystemcontains hardware to samplethe intermediate-frequency(IF) signals from the PAF andfirmware to divide the IFbandwidth into 1-MHz bandsand transport the resulting

    data to the beamformer at the centralsite. The subsystem is divided intofour 3U x 160-mm Eurocard sub-racks, each consisting of 12 digitalreceiver cards, a power card and acontrol interface card. The digitalreceiver system is shown in Figure 3.

    The power card is responsible forconverting a 48-VDC input to the volt-ages required by the digital receivercards. The control card contains a 1-Gbit Ethernet port to allow local con-trol and monitoring over UDP packets.We implemented the relatively simplerequest/response control firmware ona low-cost Virtex-5 XC5VLX30T-1FF665C FPGA. Control and moni-toring links fan out from the controlcard to the digital receiver cards viathe backplane.

    X C E L L E N C E I N S C I E N T I F I C A P P L I C AT I O N S

    Figure 1 The first ASKAP reflector surface is beinginstalled in Western Australia; researchers will later

    add a phased-array feed at the focus of the antenna.

  • Each digital receiver card sam-ples four analog IF receptor signalsfrom the PAF. The card consists ofthree major sections: analog-to-digi-tal converters, a Virtex-6 LX130TFPGA and 10-Gbps optical datatransmission components. Sampling

    is performed using two dual-input 8-bit ADCs from the semiconductorcompany e2v, sampling at a rate of768 Msamples/s and providing a totalbandwidth of 384 MHz. The XilinxFPGA receives data from the ADCsas four source-synchronous 384-MHzDDR data streams. IDELAYs onthese data lines are calibrated usingan ADC test pattern.

    The ADC data is piped into a cus-tom Radix-3 384-channel polyphasefilter bank, resulting in 1-MHz-widefrequency channels. The polyphase fil-ter bank oversamples the channels ata rate of 32/27 and thus runs at a clockrate of 303.4 MHz. The oversamplingoperation of the filter bank ensuresthat we lose minimal information atthe channel edges prior to the down-

    stream fine-filter-bank operation thatfollows the beamformer.

    We chose the Xilinx Virtex-6XC6VLX130T-1FF1156C for the digitalreceiver because it had sufficient multi-pliers and memory to fit this particularsize of polyphase filter bank. There is

    some spare capacity for a number ofmonitoring functions such as his-tograms and spectral integrators. Formany radio astronomy applicationsthe ratio of RAM to DSP is typically1:1, but the relationship must be pro-portional to I/O bandwidth. The low-est-speed-grade device achieves 384-MHz performance at the lowest cost,and the footprint is compatible withfive other larger devices.

    ANTENNA TO CENTRAL-SITEDATA COMMUNICATIONSOne of the significant challenges forthe ASKAP project team is to transportand distribute the vast amounts of real-time data flowing from the PAFs on thearray of antennas through to the digitalsignal-processing chain at the central

    site. An overview of the data transportand distribution architecture for one ofthe 36 PAFs is shown in Figure 4.

    Data leaves the digital receiversVirtex-6 FPGA by means of 16 GTXtransceivers, each operating at a linerate of 3.1875 Gbps. A 159.375-MHz

    reference clock for the serializers isgenerated locally on each digitizercard, so there is no common clock ref-erence that could correlate and causeartifacts in the data. Each link carries1/16 (19 MHz) of the total processingbandwidth for the four PAF ports thedigital receiver card processes. Sinceeach beamformer card also processes1/16 of the bandwidth, each outputlink from the digitizer FPGA is des-tined for a different beamformer card.The serial output data from the digitalreceiver FPGA is packetized andaggregated into a custom XAUI (10GAttachment Unit Interface) packetstructure. Since the 16 links are unidi-rectional, point-to-point intercon-nects, there is no need for a full pack-et-switched protocol.

    32 Xcell Journal Second Quarter 2011

    X C E L L E N C E I N S C I E N T I F I C A P P L I C AT I O N S




    Each Antenna

    Optical Fiber

    Beams toCorrelator

    Central Processing Site





    Each Antenna

    Optical Fiber BeamformingEngines

    x 36

    Figure 2 Up to 8 km of fiber separates the antenna-based processing from the commoncentral processing site of the beamformers and correlators.

  • The 16 serial output links from theFPGA are grouped into four sets offour lanes, each with XAUI frameencoding. This passes to a NetlogicXAUI transceiver device onboard thedigital receiver card that reserializesthe four input streams to a 10.51875-

    Gbps 64b/66b encoded output serialstream (Fibre Channel line rate). Thefour 10G transceivers drive fourFinisar SFP+ optical transceivers.

    Each antenna has 192 (48 digitiz-ers/antenna x 4 fibers/digitizer) single-mode fibers connecting to the central

    site, for a total data rate of 2 Tbps. Thefiber spans vary from hundreds ofmeters for the closest antennas up to 8km for the farthest.

    BEAMFORMER DATA ROUTINGTwo key components of the data-rout-ing and distribution architecture arethe Vitesse 6.5-Gbps 72 x 72 cross-point switch and the industry-stan-dard Advanced TelecommunicationsComputing Architecture (ATCA)backplane. Together, these compo-nents make it possible to route indi-vidual 3.1875-Gbps serial data streamsfrom the digital receivers directly tothe appropriate beamformer card forfurther processing.

    We use discrete ROSA (receiveroptical subassembly) devices to con-vert the 10-Gbps optical data signalsfrom the antennas back to electricalsignals, then deserialize them to four3.1875-Gbps signals via the same XAUItransceiver device used on the digitalreceiver. This circuitry resides on theATCA rear transition module, witheach RTM designed to accept 12 x 10-Gbps data streams.

    The ATCA backplane supports up to16 cards, each with four duplex 3-Gbpsdata links to every other card, meaning

    X C E L L E N C E I N S C I E N T I F I C A P P L I C AT I O N S




    ATCA RTM Board ATCA Front Board





    Optical Fiber10Gbps





    Figure 3 The digital receiver card samples, filters and transmits PAF receptor data to the central processing site.

    Figure 4 Each antenna passes 2 Tbps of data to the beamformer, which distributes it using an ATCA backplane.

    Second Quarter 2011 Xcell Journal 33

  • that on a single card there are (16-1) x4 = 60 duplex links. One of the chal-lenges of using the ATCA backplane isthat the full-mesh connections are dif-ferent on each slot, so that the signalrouting must be dynamically assignedat startup depending on which slot aparticular card is located. It is herethat we use the 72 x 72 crosspointswitch to set the routing matrix ofconnections to each card.

    Once a 3.1875-Gbps stream reachesits destination beamformer card, it isrouted to a GTX receiver on one of thefour LX240T processing FPGAs. Herethe 3.1875-Gbps stream is deserializedand the data packet decoded. Sinceeach beamformer FPGA only process-es approximately 5 MHz of the total304-MHz bandwidth, three-quarters ofthe 19-MHz bandwidth arriving on the3.1875-Gbps streams is forwarded tothe other three beamformer FPGAs onthe same card. This final hop to thebeamformer subsystem also makesuse of packetized links, but in this casethe physical transport is via LVDS I/Ointerconnects between each of theDSP FPGAs.

    The reference clock for the GTXdeserializers is generated locally oneach DSP card. Hence, the data trans-port design makes use of the clock-cor-rection and elastic-buffer features with-in the Virtex-6 GTX transceiver core.

    BEAMFORMER PROCESSING The data communications systemresults in all PAF receptors for a com-mon frequency channel arriving at aparticular beamformer FPGA. Timingsignals injected in the digital receiverare used to synchronize all 192 datastreams within the beamformer.Although the main function of thebeamforming system is to form multi-ple beams, there are also a number of

    PAF support functions that go with cal-ibration and monitoring of the system.

    The beamforming operation takesthe serial streams of PAF data andbuffers them in a dual-port RAM.While data is flowing in from thecommunication links it is possible toform beams on the previously storedtime slice. A set of beamformerweights contains a list of receptors tobe included in the beamforming com-putation and an associated complexweight. DSP48 elements then multi-ply and accumulate the result foreach time slice.

    Data from each of the beamform-ers (there can be up to 36 beams inparallel) is then stored in externalDDR3 memory. Each FPGA is direct-ly connected to one stick of DDR3with an interface operating at 64 bitsx 800 Mbps.

    Before the correlator processes thebeam data, we apply a secondpolyphase filter bank to set the finalfrequency resolution of the beams.The relatively slow data input into thissecond filter bank and the large num-ber of beams necessitate a larger stor-age area. Xilinx FPGA internal blockRAM operates very differently thanDDR3 memory. BRAM has a very highbandwidth but limited depth, vs. a rel-atively low bandwidth and essentiallyunlimited depth of DDR3. Beam data

    X C E L L E N C E I N S C I E N T I F I C A P P L I C AT I O N S

    Figure 5 A 16-slot ATCA shelf holds the beamforming hardware,

    the heart of which is a Virtex-6 LX240T.

    Figure 6 Anchoring the beamforming hardware are the ATCA mesh, crosspoint switch, Virtex-6 FPGAs, DDR3 memory and SFP+ optics.

    34 Xcell Journal Second Quarter 2011

  • is written into DDR3 memory and readout in an order appropriate for the fil-ter bank and correlator.

    One of the support functions thebeamformer FPGA also performs isthe calculation of an array co-vari-ance matrix (ACM). The ACM is used

    to calibrate the array, and the calcu-lation process takes each PAF recep-tor and cross-correlates it withevery other receptor. The result isintegrated for several seconds so asto allow the result to converge andalso to reduce the output data rate.Both the beamformer and the ACMprocess data at 318.75 MHz (twicethe basic communications clock of159.375 MHz). Once the ACM is cal-culated, it is output from each beam-former card over Gigabit Ethernetusing UDP packets.

    The beamformer card (see Figure6) is a standard ATCA card (8U x 280mm) consisting of four signal-process-ing FPGAs (Virtex-6 XC6VLX240T-1FF1156C), one control FPGA (Virtex-5 LX50T), a crosspoint switch (VitesseVSC3172), four sticks of DDR3 PC3-8500 memory and eight SFP+ ports(four full duplex, four transmit only).

    DEVELOPMENT SOFTWAREThe design flow that we used for alldigital processing blocks (test signalgenerators, polyphase filters, beam-formers) centers on Xilinx SystemGenerator for MATLAB/Simulink.We wrote data communications, con-trol and monitoring, and interfaces inVHDL, using System Generator tosimulate the signal-processing blocksand ModelSim to simulate the HDL-based blocks.

    After simulation, we independentlyverified each block in hardware andthen integrated them with other com-pleted blocks to confirm that the com-plete data and processing path is operat-ing as expected. The ChipScope inte-grated logic analyzer tools are in heavy

    use during this stage to track down bugsand also for easy visualization of controland data processes on-chip.

    We stitch together the entire designby incorporating the DSP netlists fromSystem Generator (in NGC format) viaVHDL wrappers, along with all the otherHDL-based blocks. We perform synthe-sis and implementation using XST andthe ISE Project Navigator.

    The digital receiver and beamformerdesigns are very BRAM and DSP48Eintensive (see Table 1), which tends tomake it difficult to meet timing withsimple constraints. Typically we needto apply some floorplanning. The pri-mary goal of floorplanning is toimprove timing by reducing routedelay. The PlanAhead tool providesthe insight into our design that we needto be able to apply pipelining on criticalnets and guide the map and place-and-route tools to meet our timing require-ments. Usually this means optimallyplacing key sets of BRAMs in additionto block-based design area constraints.This kind of design analysis also pro-vides the ability to greatly reduceimplementation runtime.

    Table 1 provides a summary of theclock speeds for each FPGA devicetogether with the utilization statisticsfor the FPGA resources. The totalnumber of FPGAs in the digitalreceiver and beamforming systems isalso listed for reference.

    LOOKING AHEADXilinx Virtex-6 FPGAs have provento be a success for the ASKAP digitalback end, both in processing per-formance and I/O bandwidth fordata communications. The release ofthe Virtex-7 device family has our

    research team excited about evenfaster data rates, higher processingdensity and lower power consump-tion. Larger Virtex-7/Kintex-7 deviceswith enhanced I/O will significantlychange system architectures and leadto boards that are less expensive todesign and manufacture.

    With the rapid development andadoption of software-defined radio inthe commercial world, SKA prototypeimplementations leveraging this newtechnology also become a strong possi-bility as faster, lower-cost, lower-powerADCs are released in combination withXilinxs newer FPGAs. This will lead toa complete digital solution for directsampling and processing of PAF data.All of which augurs well for the mas-sive digital systems required for theultimate radio telescopethe SKA (

    AcknowledgementsThe authors are part of the DigitalSystems group, an arm of a largerteam within C