of 68 /68
R Issue 61 Third Quarter 2007 Xcell journal Xcell journal SOLUTIONS FOR A PROGRAMMABLE WORLD SOLUTIONS FOR A PROGRAMMABLE WORLD INSIDE PCI Express and FPGAs Reducing CPU Load for Ethernet Applications A High-Speed Serial Connectivity Solution with Aurora IP Xilinx FPGAs Adapt to Ever-Changing Broadcast Video Landscape Making the Most of MOST Control Messaging INSIDE PCI Express and FPGAs Reducing CPU Load for Ethernet Applications A High-Speed Serial Connectivity Solution with Aurora IP Xilinx FPGAs Adapt to Ever-Changing Broadcast Video Landscape Making the Most of MOST Control Messaging www.xilinx.com/xcell/ Inside Look: Connectivity Illuminated Inside Look: Connectivity Illuminated PROGRAMMABLE SYSTEMS SOLUTIONS

PROGRAMMABLE SYSTEMS SOLUTIONS Xcell - Xilinx · By providing a stable hardware platform that enhances ... This case study presents the design of ... volume platforms ensure low-cost

Embed Size (px)

Text of PROGRAMMABLE SYSTEMS SOLUTIONS Xcell - Xilinx · By providing a stable hardware platform that...

  • R

    Issue 61Third Quarter 2007

    Xcell journalXcell journalS O L U T I O N S F O R A P R O G R A M M A B L E W O R L DS O L U T I O N S F O R A P R O G R A M M A B L E W O R L D

    INSIDE

    PCI Express and FPGAs

    Reducing CPU Load forEthernet Applications

    A High-Speed SerialConnectivity Solution with Aurora IP

    Xilinx FPGAs Adapt to Ever-Changing BroadcastVideo Landscape

    Making the Most of MOST Control Messaging

    INSIDE

    PCI Express and FPGAs

    Reducing CPU Load forEthernet Applications

    A High-Speed SerialConnectivity Solution with Aurora IP

    Xilinx FPGAs Adapt to Ever-Changing BroadcastVideo Landscape

    Making the Most of MOST Control Messaging

    www.xilinx.com/xcell/

    Inside Look:ConnectivityIlluminated

    Inside Look:ConnectivityIlluminated

    P R O G R A M M A B L E S Y S T E M S S O L U T I O N S

  • Avnet Electronics Marketing designs, manufactures, sells and

    supports a wide variety of hardware evaluation, development

    and reference design kits for developers looking to get a

    quick start on a new project.

    With a focus on embedded processing, communications

    and networking applications, this growing set of modular

    hardware kits allows users to evaluate, experiment,

    benchmark, prototype, test and even deploy complete

    designs for field trial.

    By providing a stable hardware platform that enhances

    system development, design kits from Avnet Electronics

    Marketing help original equipment manufacturers (OEMs)

    bring differentiated products to market quickly and in the

    most cost-efficient way possible.

    For a complete listing of available boards, visit

    www.em.avnet.com/drc

    Support Across The Board.

    Design Kits Fuel Feature-Rich Applications

    Build your own system bymixing and matching:

    Processors

    FPGAs

    Memory

    Networking

    Audio

    Video

    Mass storage

    Bus interface

    High-speed serial interface

    Available add-ons:

    Software

    Firmware

    Drivers

    Third-party development tools

    Enabling success from the center of technology

    1 800 332 8638em.avnet.com

    Avnet, Inc. 2007. All rights reserved. AVNET is a registered trademark of Avnet, Inc.

  • InnovationLeadership above all

    Xilinx brings your biggest ideas to reality: Virtex-5 FPGAs Highest Performance. With multiple platformsoptimized for logic, serial connectivity, DSP, and embedded process-ing, Virtex-5 FPGAs lead the industry in performance and density.

    Spartan-3 Generation FPGAs Lowest Cost. A unique balanceof features and price for high-volume applications. Multipleplatforms allow you to choose the lowest cost device to fit yourspecific needs.

    CoolRunner-II CPLDs Lowest Power. Unbeatable for low-powerand handheld applications, CoolRunner-II CPLDs deliver more forthe money than any other competitive device.

    ISE Software Ease-of-Design. With SmartCompile technology,users can achieve up to 6X faster runtimes, while preserving timing and implementation. Highest performance in the fastesttime the #1 choice of designers worldwide.

    Visit our website today, and find out why Xilinx products areworld renowned for leadership . . . above all.

    www.xilinx.com

    At the Heart of Innovation

    2007 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

    Get started quickly with easy-to-use kits

    Order online at www.xilinx.com/getstarted

  • WL E T T E R F R O M T H E P U B L I S H E R

    Xilinx, Inc.2100 Logic DriveSan Jose, CA 95124-3400Phone: 408-559-7778FAX: 408-879-4780www.xilinx.com/xcell/

    2007 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands includedherein are trademarks of Xilinx, Inc. All other trade-marks are the property of their respective owners.

    The articles, information, and other materials includedin this issue are provided solely for the convenience ofour readers. Xilinx makes no warranties, express,implied, statutory, or otherwise, and accepts no liabilitywith respect to any such articles, information, or othermaterials or their use, and any use thereof is solely atthe risk of the user. Any person or entity using suchinformation in any way releases and waives any claim itmight have against Xilinx for any loss, damage, orexpense caused thereby. Forrest Couch

    Publisher

    PUBLISHER Forrest [email protected]

    EDITOR Charmaine Cooper Hussain

    ART DIRECTOR Scott Blair

    DESIGN/PRODUCTION Teie, Gelwicks & Associates1-800-493-5551

    ADVERTISING SALES Dan Teie1-800-493-5551

    TECHNICAL COORDINATOR Alex Goldhammer

    INTERNATIONAL Dickson Seow, Asia [email protected]

    Andrea Barnard, Europe/Middle East/[email protected]

    Yumi Homura, [email protected]

    SUBSCRIPTIONS All Inquirieswww.xcellpublications.com

    REPRINT ORDERS 1-800-493-5551

    Xcell journal

    www.xilinx.com/xcell/

    Can You Hear Me Now?Weve all experienced it with our mobile phones: poor voice quality, dropped calls, not enough barsto connect. Although inconvenient, the problem usually goes away by moving around or waiting afew minutes before trying your call again.

    But if youre a system designer, connectivity problems launch a multi-dimensional conversation thatsquite a bit more complicated to solve. Lets start with your target market. Is your product a solutionfor the aerospace/defense, automotive, broadcast, consumer, server/storage, industrial/scientific/medical, wired, or wireless market? What are the design challenges in that market, such as signalintegrity, power management, area, and performance? In addition to these decisions, you must selectthe proper protocol for the unique components of your system hierarchy, such as the ubiquitous PCIExpress at one end and Gigabit Ethernet at the other.

    Connectivity, in its myriad of forms, is an increasingly important factor in successful system design.Whether designing chip-to-chip, board-to-board, or box-to-box, Xilinx has a connectivity solutionthat helps you differentiate your product within your target market, bring it to market as early aspossible and at the lowest cost. With our connectivity hard blocks, you get the benefits of an ASSP aswell as the flexibility of an FPGA.

    This issue of Xcell Journal focuses on connectivity and brings light to some of the key issues, as wellas presenting implementation examples. Additionally, Xilinx offers a variety of resources to help withyour connectivity design challenges. Here are a few connectivity resources that may help.

    Connectivity Central(www.xilinx.com/products/design_resources/conn_central/)Xilinx provides end-to-end connectivity solutions supporting serial and parallel protocols. Design forsuccess with Virtex and Spartan series FPGAs, IP cores, tools, and development kits.

    Virtex-5 LXT FPGA Development Kit for PCI Express(www.xilinx.com/xlnx/xebiz/designResources/ip_product_details.jsp?key=HW-V5-ML555-G)The Virtex-5 LXT FPGA Development kit for PCI Express supports PCIe/PCI-X/PCI. This complete development kit passed PCI-SIG compliance for PCI Express v1.1 and enables you torapidly create and evaluate designs using PCI Express, PCI-X, and PCI interfaces.

    Spartan-3 PCI Express Starter Kit(www.xilinx.com/onlinestore/spartan_boards.htm)This complete development board solution gives you instant access to the capabilities of theSpartan-3 family and the Xilinx PCI Express core.

    IP Evaluation(www.xilinx.com/ipcenter/ipevaluation/index.htm)Take advantage of Xilinx evaluation IP to try before you buy. Visit the Xilinx IP EvaluationLounge and download the evaluation libraries using the links under the Related Products tab onthe right side of the page. The IP on this site is covered by the Xilinx LogiCORE EvaluationLicense Agreement.

    On-Site Help(www.xilinx.com/xlnx/xebiz/designResources/ip_product_details.jsp?key=xgs_ss_titanium)For help with your connectivity designs, Titanium Technical Service provides a dedicated applicationengineer, either on-site or at Xilinx, who can help with system architecture optimization, toolcoaching, and back-end optimization.

  • O N T H E C O V E R

    2626C O N N E C T I V I T Y

    3131C O N N E C T I V I T Y

    2020C O N N E C T I V I T Y

    Reducing CPU Load for Ethernet ApplicationsA TOE makes 10 Gigabit Ethernet possible.

    5353I N T E L L E C T U A L P R O P E R T Y

    Making the Most of MOST Control MessagingThis case study presents the design of Xilinx LogiCORE MOST NIC control message processing.

    A High-Speed Serial Connectivity Solution with Aurora IPAurora is a highly scalable protocol for applications requiring point-to-point connectivity.

    Xilinx FPGAs Adapt to Ever-Changing Broadcast Video LandscapeNew digital broadcast standards and applications are addressed by advanced silicon, software, and free reference designs available with Xilinx FPGAs.

    PCI Express and FPGAs Why FPGAs are the best platform for building PCI Express endpoint devices.1414

  • T H I R D Q U A R T E R 2 0 0 7, I S S U E 6 1 Xcell journalXcell journalVIEWPOINTS

    Letter from the Publisher. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    Selecting the Right Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    FEATURES

    Connectivity

    Scaling Chip-to-Chip Interconnect Made Simple. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    PCI Express and FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    Virtex-5 FPGA Techniques for High-Performance Data Converters. . . . . . . . . . . . . . . . . . . . . . . 18

    Reducing CPU Load for Ethernet Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    Automated MGT Serial Link Tuning Ensures Design Margins . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    A High-Speed Serial Connectivity Solution with Aurora IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    Xilinx FPGAs Adapt to Ever-Changing Broadcast Video Landscape . . . . . . . . . . . . . . . . . . . . . . 31

    Serial RapidIO Connectivity Enhances DSP Co-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    The NXP/PLDA Programmable PCI Express Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    Memory Interface

    Create Memory Interface Designs Faster with Xilinx Solutions . . . . . . . . . . . . . . . . . . . . . . . . 47

    Intellectual Property

    Driving Home Multimedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    Making the Most of MOST Control Messaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    Leveraging HyperTransport on Xilinx FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    GENERAL

    FPGA-Based Simulation for Rapid Prototyping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    RESOURCES

    Featured Connectivity Application Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    Connectivity Boards and Kits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    The Connectivity Curriculum Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    XPECTATIONS

    Developing Technical Leaders through Knowledge Communities . . . . . . . . . . . . . . . . . . . . . . . 66

  • by Jag BolariaSr. AnalystThe Linley [email protected]

    Interconnects haveevolved from par-allel to serial andincreased in com-

    plexity to enable communications withgreater efficiency and less congestion or hotpoints. In addition to connecting end-points, modern interconnects define com-prehensive protocols for moving dataefficiently across a network of endpoints.

    Thus, the networks and endpoints thatmust be interconnected often drive therequirements for an interconnect. Theserequirements include data rate, latency,lossy or lossless links, scalability, andredundancy. These requirements then drivethe selection of the appropriate intercon-nects for a specific network.

    The existing ecosystem for an intercon-nect technology is another important fac-tor in selection. A good ecosystem can helpreduce development cost and time to mar-ket. In this article, Ill look at some of theleading interconnects and position thosefor specific applications or market seg-ments. The Linley Groups report on High-Speed Interconnects, available atwww.linleygroup.com, provides more detailson various interconnects and the leadingproducts for each.

    PCIe and EthernetOur research shows that the leading inter-connects are driven by large-volume plat-forms. The economies of scale from largervolume platforms ensure low-cost buildingblocks and broad availability. Additionally,large deployments lead to field-proventechnologies that can be applied in otherplatforms with minimal risk.

    Two of the largest platforms are PCsand networking equipment. The PC plat-form drives PCI Express (PCIe) andEthernet, while networking equipmentdrives only Ethernet. But because theseinterconnects were developed for specificapplications, they are not a natural fit inmany other markets. The semiconductorindustry and system vendors are evolvingthese interconnects to meet requirementsfor new applications.

    For example, PCIe scalability hasevolved to support greater data rates andmore lanes. With IOV (I/O virtualiza-tion), PCIe is evolving to support virtual-ization, which enables its deployment instorage systems and blade servers.

    With IEEE802.3ar and BCN (back-ward congestion notification), Ethernetenhancements include better flow control,congestion management, and attempts toaddress its inherently lossy nature. Theseenhancements will strengthen Ethernetapplicability for storage systems, data cen-ters, and backplanes.

    Although Ethernet and PCIe are nowsuitable for more applications, they still fallshort in meeting the technical and businessrequirements for all systems. Blade servers,for example, use a combination of Ethernetand Fibre Channel (FC). Although OEMsmay want to consolidate these fabrics, endusers have a large investment in FC andwant support for that now and in the future.

    PCIe and Ethernet also fall short in meet-ing the scalability, latency, and losslessrequirements of high-performance comput-ing (HPC) applications. HPC uses a special-ized interconnect such as InfiniBand, whichprovides better latency and scalability. In thiscase, OEMs will need flexible interconnectsolutions to enable common platforms andthus service different user requirements.

    8 Xcell Journal Third Quarter 2007

    Selecting the Right Interconnect Because interconnect characteristics vary, define your requirements clearly before selecting the appropriate interconnect.

    V I E W P O I N T

  • Everyone ElseEndpoints and specific system requirementsoften drive the development of specializedinterconnects. RapidIO is one such exam-ple. Steered by system and chip vendors,RapidIO has evolved to address the uniquerequirements of the wireless infrastructure.It enables distributed computing on linecards and networking/wireless infrastruc-ture systems better than most competinginterconnects. RapidIO is also integratedon DSPs from Texas Instruments andPowerPC CPUs from Freescale.

    Because base stations use farms of DSPs,it is an easy decision to use RapidIO as aninterconnect in these applications. Overtime, we expect RapidIO to expand toother platforms that perform digital signalprocessing on multiple data streams.

    Examples of other specialized intercon-nects include XFI, SFI, XAUI, SPAUI,Interlaken, SPI-S, and KR. These intercon-nects were developed to address the veryspecific low-level requirements of eachapplication. Although addressing all inter-connects is beyond the scope of this article,lets look at a few to highlight the problemseach solves and its impact in systems.

    XFI and SFI are used to connect opti-cal modules at 10 Gbps. At these datarates, the major challenge is signal condi-tioning, including electronic dispersioncompensation for the fiber and equaliza-tion for the board traces and connectors.These requirements drive specialized com-ponents designed specifically for the char-acteristics of the channel through whichthe signal travels.

    Because data at these rates may be chan-nelized that is, include multiple streamson a single physical link it becomesimportant to add traffic management.Specifications such as Interlaken, SPI-S,and SPAUI address high data rates as wellas traffic management. Because no singlestandard exists, we believe system designers

    need to design in solutions that provideflexibility in meeting current and futurerequirements.

    The combination of 10 Gbps rates onthe network and multiport line cards drivesthe need for greater bandwidth and there-fore greater data rates over the backplane.The IEEE 802.3ap addresses this with its10GBase-KR specification, which defines10-Gbps serial links. In addition to equal-ization and pre-emphasis, it may be neces-sary to include forward error correction foracceptable performance over a couple ofconnectors and up to 40-inch traces com-mon in backplanes. Additionally, these sys-tems may need compatibility to older linecards, driving the need for backplane oper-ation at 1 Gbps or 3.125 Gbps. Again, aflexible solution is critical to meet the sys-tem requirements.

    ConclusionThere are many different applications forinterconnects and many interconnectchoices for system designers. We expectPCIe and Ethernet to be the dominantinterconnects. These will be used in servers,networking, storage systems, wireless net-works, and many other systems. There is,however, no one interconnect (or two) thatcan meet the requirements of all systems.Therefore, the industry has developed andwill continue to support specialized inter-connects for different applications.

    Both dominant and specialized inter-connects will continue to evolve to supportgreater data rates, reduced latency, and bet-ter scalability. Additionally, systems willneed to support legacy cards.

    We recommend that system designersselect the best interconnect and design inflexibility to cover different interconnectsand the evolving changes in each. FPGAsplay a critical role in offering systemdesigners this flexibility and supporting thebroad interconnect landscape.

    Third Quarter 2007 Xcell Journal 9

    Both dominant and specialized interconnectswill continue to evolve to support greater datarates, reduced latency, and better scalability.

    V I E W P O I N T

    Xilinx Events

    North America

    July 23-27Nuclear and Space Radiation Effects Conference 2007Honolulu, HI

    August 7-9NI Week 2007Austin, TX

    September 18-20Intel Developer ForumSan Francisco, CA

    October 29-31Military Communications Conference 2007Orlando, FL

    November 5-9Software Defined Radio Forum 2007Denver, CO

    Europe, Middle East, and Africa

    August 27-29International Conference on Field Programmable Logic and ApplicationsAmsterdam, Netherlands

    September 6-11IBCAmsterdam, Netherlands

    Asia Pacific

    August 7-10Agilent Digital Measurement ForumTaiwan

    Xilinx participates in numeroustrade shows and events through-

    out the year. This is a perfectopportunity to meet our siliconand software experts, ask ques-

    tions, see demonstrations of newproducts and technologies, andhear other customers success stories with Xilinx products.

    For more information and themost up-to-date schedule, visit

    www.xilinx.com/events.

  • www.agilent.com/find/logic-offer

    Agilent Technologies, Inc. 2006 Windows is a U.S. registered trademark of Microsoft Corporation

    Logic analyzers up to 50% offLogic analyzers up to 50% offA special limited-time offer from Agilent.

    Now you can see inside your FPGA designs in a way that will save weeks of development time.

    The FPGA dynamic probe, when combined with an AgilentWindows-based logic analyzer, allows you to access differentgroups of signals inside your FPGA for debugwithout requiringdesign changes. Youll increase visibility into internal FPGA activityby gaining access up to 128 internal signals with each debug pin.

    Our 16800 Series logic analyzers offer you unprecedented price-performance in a portable family, with up to 204 channels, 32 M memory depth and a pattern generator available.

    And now for a limited time, you can receive up to 50% off ournewest 16901A modular logic analyzer mainframe when youpurchase eligible measurement modules. Offer valid February 1,2007 through August 15, 2007. Reference promotion 5.564.

    Increased visibility with FPGA dynamic probe Customized protocol analysis with

    Agilent's exclusive packet viewer software Low-cost embedded PCI Express

    packet analysis Pricing starts at $9,450

    Agilent portable and modularlogic analyzers

  • by Farhad Shafai Vice President, R&D Sarance Technologies [email protected]

    Kelvin SpencerSenior Design EngineerSarance Technologies [email protected]

    As the world gets connected, the demandfor bandwidth continues to increase. Theinterconnect technology for communica-tion systems must not only connect devicestoday, but also provide a roadmap for thefuture. Traditional solutions, such as XAUIor SPI-4.2, cannot scale beyond 10 Gbps.SPI-4.2 uses a low-speed parallel bus,which requires a large number of pins totransfer 10 Gbps of data. XAUI does nothave any provisions for channelizing thepacket stream, making it unsuitable forapplications that differentiate betweenpackets. Several attempts have been madeto build on XAUI and SPI-4.2; all deriva-tives, however, suffer from the inherentlimitations of the solutions on which theyare based, and are therefore not optimal.

    Interlaken is a new chip-to-chip, chan-nelized packet interface protocol developedby Cisco Systems and Cortina Systems. It isbased on SERDES technology and provides

    a framework for an efficient and robustchip-to-chip packet interface that easilyscales from 10 Gbps to 50 Gbps andbeyond. Several applications for Interlakenare shown in Figure 1. Simply put, you canuse Interlaken as the connectivity solutionfor all devices in a typical network commu-nication line card. The devices includefront-end aggregation ICs or FPGAs, net-work processors, and traffic managers.Additionally, you can use translationFPGAs to bridge between legacy and mod-ern devices that have Interlaken interfaces.

    Sarance Technologies has developed a suiteof Interlaken IP cores (IIPC) targeted atXilinx Virtex-5 FPGAs. The IIPC familyof cores is a highly optimized implementationof Interlaken that takes advantage of theadvanced features of the Virtex-5 device. TheIIPC abstracts all of the details of Interlakenand provides a very simple and straightforwardinterface. All members of the IIPC family usethe same user-side protocol and programminginterface, greatly simplifying performance scal-ing: using a 10-Gbps IIPC core is no differentthan using a 50-Gbps IIPC.

    Third Quarter 2007 Xcell Journal 11

    Scaling Chip-to-Chip Interconnect Made SimpleSarances high-performance Interlaken IP cores connect devices at up to 50 Gbps.

    I

    I I

    I I I

    PortAggregation

    IC

    PortAggregation

    IC

    NPUFPGA

    TranslationFPGA

    NPUIC

    TMIC/FPGA

    TMFPGA

    InterlakenInterlaken

    Interlaken SPI4.2 SPI4.2

    Figure 1 Typical applications

    C O N N E C T I V I T Y

  • Interlaken BasicsInterlaken is a narrow, high-speed channel-ized chip-to-chip interface. For simplicityssake, we will not discuss the finer details ofthe protocol. At a high level, the basic con-cepts of the protocol include:

    Support for 256 logic channels

    Data scrambling and 64B/67B dataencoding to ensure proper DC balance

    Segmentation of data into bursts delin-eated by control words

    CRC24 protection for each data burst

    CRC32 protection for each lane

    Protocol independence from the num-ber of SERDES lanes and SERDES rate

    Support for in-band and out-of-bandflow-control mechanisms

    Lane diagnostics and lane decommissioning

    A typical chip-to-chip implementationis shown in Figure 2. The packet data isstriped across any number of high-speedserial lanes by the transmitting device andthen reassembled by the receiving device.

    and can transmit 10 Gbps of real payload.Scaling the interface up to 20G simplyrequires doubling the number of MGTs toeight; the bandwidth scales accordingly.

    Sarances IIPC FamilyThe IIPC is a highly optimized imple-mentation of Interlaken and, remainingtrue to the intent of Interlaken, is archi-tected to offer the same flexibility andscalability offered by the protocol. IIPCcan stripe and reassemble data across anynumber of SERDES lanes running at anySERDES rate and has support for as manyas 256 channels. It is fully compliant withrevision 1.1 of the Interlaken specificationdocument and is hardware-proven to beinteroperable with several ASIC imple-mentations of Interlaken.

    Table 1 lists the device utilization andimplementation details for three differentIIPC cores targeted to Virtex-5 LXT FPGAs.

    Figure 3 shows the block diagram of theIIPC. At the time of this writing, IIPC iscapable of supporting as much as 50 Gbpsof raw bandwidth. The IIPC is divided intotwo major functional partitions:

    Lane logic that is replicated for eachSERDES lane

    Striping and reassembly logic and theuser-side interface

    Lane-logic circuitry is the dominantportion of the overall utilization of the

    The protocol is independent from thenumber of SERDES lanes and theSERDES rate, which makes the perform-ance proportional to the number ofSERDES lanes. As an example, consider a10-Gbps system. Using four multi-gigabittransceivers (MGTs) running at 3.125Gbps, we can build an interface with a totalraw bandwidth of 12.5 Gbps that hasenough headroom for protocol overhead

    Device A Device B

    IncomingPacketStream

    Stripe DataAcross n Lines

    InterlakenTransmitter

    InterlakenReceiver

    Lane 0

    Lane 1

    Lane 0

    Lane 1

    Lane n-1

    Lane n

    Lane n-1

    Lane n

    Reassemble DataFrom n Lanes

    OutgoingPacketStream

    CRC32, Scrambler/Desrambler, Gearbox

    clkout

    dataout[255|127|63:0]

    mtyout[4|3|2:0]

    chanout[7:0]

    {enaout, sopout,eopout, errout}

    clkin

    rdyout

    datain[255|127|63:0]

    mtyin[4|3|2:0]

    chainin[7:0]

    {enain, sopin,eopin, errin}

    configuration andstatus bus

    Str

    ipin

    g, R

    eass

    emb

    ly, C

    RC

    24

    Pro

    toco

    l Man

    agem

    ent,

    Use

    r-S

    ide

    Inte

    rfac

    e

    Lane 0

    Lane n

    Interlaken Core

    Figure 2 A typical chip-to-chip implementation (only the unidirectional link is shown)

    Bandwidth MGT Lanes MGT Rate Logic LUTs Block RAMs

    12.5 Gbps 4 3.125 Gbps 9,000 5

    25 Gbps 8 3.125 Gbps 18,000 5

    50 Gbps 16 3.125 Gbps 32,500 5

    Table 1 Configuration and Virtex-5 LXT resource utilization

    Figure 3 Block diagram (SERDES is on the left; the user-side interface is on the right)

    12 Xcell Journal Third Quarter 2007

    C O N N E C T I V I T Y

  • Third Quarter 2007 Xcell Journal 13

    C O N N E C T I V I T Y

    IIPC, since it is replicated for each lane.Each lane has a gear box, CRC32, and adescrambler/scrambler module. All of thesefunctions have traditionally been veryexpensive in FPGA technology. Our imple-mentation of these functions, however,takes full advantage of the Virtex-5 devicessix-input LUTs in a way that makes the cir-cuits very compact and efficient.

    The striping and reassembly logic per-forms the required MUXing to transfer databetween the user-side interface and the lanecircuitry and handles link-level functions.Although the striping and reassembly logic isrelatively smaller in terms of area, it has toprocess the total bandwidth of the interfaceand therefore is the most timing-critical part.

    Specifically, the CRC24 function has topotentially process 50 Gbps of data. Again,we have made the most of the FPGAs hard-ware features to come up with a very efficientand high-performance implementation ofthe CRC24 function.

    User-Side InterfaceTo help ease the integration of the IIPC withother logic in the FPGA, we have imple-mented a very simple and straightforwarduser-side interface. The bus protocol fortransferring packet data to and from theIIPC is similar to the familiar SPI-like busprotocols that are commonly used in theindustry. The configuration interface com-prises a set of configuration input signals andanother set of status output signals that canbe easily connected to any processor inter-face. The status signals monitor the status ofthe link and identify possible configurationor transmission errors.

    One key feature of our user-side inter-face is that it can be set to be identical forthe entire IIPC family. This feature allowsyou to implement the same user-side logicin all of your designs, independent of theconfiguration or bandwidth of theInterlaken interface. You can build a 10Gdesign today knowing that your configura-

    tion software and FPGA architecture donot change when the design is scaled to20G, 40G, and beyond. Even if you decideto change the SERDES rate or number oflanes, the user-side interface is still notaffected. IIPC provides you with a solutionfor today and for the future in a single,highly optimized package.

    Ease of UseUsing the IIPC is as simple as powering upthe FPGA, setting the configuration regis-ters, resetting the core, and waiting for thecore to signal that it is ready. The IIPC willautomatically communicate with the otherdevice and, when link integrity is estab-lished, will set a status signal. All you have

    to do is monitor the status signal and startsending packets when it is asserted.

    The IIPC handles all of the details ofInterlaken, including automatic word andlane alignment and automatic scram-bler/descrambler synchronization. In addi-tion, the IIPC performs full protocolchecking and error handling. It recoversfrom all error conditions (any number ofbit errors are properly detected and appro-priately handled) and will never violate theuser-side protocol.

    ConclusionInterlaken is the future of chip-to-chip pack-et interfaces. It combines the benefits of thelatest SERDES technology and a simple yetrobust protocol layer to define a flexible andscalable interconnect technology. SaranceTechnologiess IIPC is an optimized imple-mentation of the Interlaken revision 1.1specification targeted for the Virtex-5 FPGA.

    Our core is hardware-proven to interop-erate with several ASIC implementations ofInterlaken and can support up to 50 Gbps ofraw bandwidth. Our roadmap has us increas-ing the bandwidth to 120 Gbps in the verynear future. For the latest updates and infor-mation about our interactive demonstrationplatform, e-mail [email protected]

    Our roadmap has us increasing the bandwidth to 120 Gbps in the very near future.

  • by Alex GoldhammerTechnical Marketing Manager, Platform SolutionsXilinx, [email protected]

    PCI Express is a high-speed serial I/Ointerconnect scheme that employs a clockdata recovery (CDR) technique. The PCIExpress Gen1 specification defines a linerate of 2.5 Gbps per lane, allowing you tobuild applications that have a throughputof 2 Gbps (after 8B/10B encoding) for asingle-lane (x1) link to 64 Gbps for 32lanes. This allows a significant reduction inpin count while maintaining or improvingthroughput. It also reduces the size of thePCB, the number of traces and layers, andsimplifies layout and design. Fewer pinsalso translate to reduced noise and electro-magnetic interference (EMI). CDR elimi-nates the clock-to-data skew problemprevalent in wide parallel buses, makinginterconnect implementations easier.

    The PCI Express interconnect architec-ture is primarily specified for PC-based(desktop/laptop) systems. But just likePCI, PCI Express is also quickly movinginto other system types, such as embeddedsystems. It defines three types of devices:root complex, switch, and endpoint (Figure1). The CPU, system memory, and graph-ics controller connect to a root complex,which is roughly equivalent to a PCI host.Because of PCI Express point-to-pointnature, switch devices are necessary toexpand the number of system functions.PCI Express switch devices connect a rootcomplex device on the upstream side toendpoints on the downstream side.

    Endpoint functionality is similar to aPCI/PCI-X device. Some of the most com-mon endpoint devices are Ethernet con-trollers or storage HBAs (host-busadapters). Because FPGAs are most fre-quently used for data processing and bridg-ing functions, the largest target function forFPGAs is endpoints. FPGA implementa-tions are ideally suited for video, medicalimaging, industrial, test and measurement,data acquisition, and storage applications.

    The PCI Express specification main-tained by the PCI-SIG mandates that everyPCI Express device use three distinct proto-

    14 Xcell Journal Third Quarter 2007

    PCI Express and FPGAsWhy FPGAs are the best platform for building PCI Express endpoint devices. Why FPGAs are the best platform for building PCI Express endpoint devices.

    C O N N E C T I V I T Y

  • col layers: physical, data-link, and transac-tion. You can build a PCI Express endpointusing a single- or two-chip solution. Forexample, you can use a low-cost FPGA suchas a Xilinx Spartan-3 device for build-ing data-link and transaction layers with acommercially available discrete PCI ExpressPHY (Figure 2). This option is best suitedfor x1 lane applications such as bus con-trollers, data-acquisition cards, and per-formance-boosting PCI 32/33 devices. Oryou can use a single-chip solution such asVirtex-5 LXT or SXT FPGAs, whichhave an integrated PCI Express PHY. Thisoption is best for communications or high-definition audio/video endpoint devices(Figure 3) that require higher performanceof x4 (8-Gbps throughput) or x8 (16-Gbpsthroughput) links.

    Before selecting a technology for imple-menting a PCI Express design, you mustcarefully consider the choice of IP, link effi-ciency, compliance testing, and availabilityof resources for the application. In this arti-cle, Ill review the factors for building sin-gle-chip x4- and x8-lane PCI Expressdesigns with the latest FPGA technology.

    Choice of IPAs a designer, you can choose to buildyour own soft IP or buy IP from either athird party or an FPGA vendor. The chal-lenge of building your own IP is that notonly do you have to create the designfrom scratch, you also have to worryabout verification, validation, compli-ance, and hardware evaluation. IP pur-chased from a third party or FPGAvendor will have gone through all of therigors of compliance testing and hardwareevaluation, making it plug and play.When working with a commercially avail-able, proven, compliant PCI Expressinterface, you can focus on the mostvalue-added part of the design: the userapplication. The challenge of using softIP is the availability of resources for theapplication. As the PCI Express MAC,data-link, and transaction layers in soft IPcores are implemented using programma-ble fabric, you must pay special attentionto the number of remaining block RAMs,look-up tables, and fabric resources.

    Third Quarter 2007 Xcell Journal 15

    PCI ExpressGraphics: 16X

    CPU

    Root Complex

    Switch

    Switch

    Switch

    Memory

    PCIBridge

    x1 EndPoint

    x1 EndPoint

    x4 EndPoint

    x8 EndPoint

    PCI

    Proprietary LVDS Interface

    PIPE Interface

    1x PCI Express Connector

    To Sensors

    PC Motherboard

    FPGAControl

    Processor

    PHY

    CPU

    DD

    R 2

    DD

    R 2

    HD-SDI

    x8 PCI Express

    FPGA

    Figure 1 PCI Express topology

    Figure 2 Spartan-3 FPGA-based data-acquisition card

    Figure 3 Virtex-5 LXT FPGA-based video application

    C O N N E C T I V I T Y

  • Another option is to use an FPGA with thelatest technology. The Virtex-5 LXT and SXThave an integrated x8-lane PCI Express con-troller implemented in dedicated gates (Figure4). This type of implementation is very advan-tageous, as it requires a minimal number ofFPGA logic resources because the design isimplemented in hard silicon. For example, inthe Virtex-5 LXT FPGA, an x8-lane soft IPcore can consume up to 10,000 logic cells,while the hard implementation will needabout 500 logic cells, mostly for interfacing.This resource savings sometimes allows you tochoose a smaller device, which is generallycheaper. Integrated implementations also havetypically higher performance, wider datapaths, and are software-configurable.

    Another challenge with soft IP imple-mentations is the number of features.Typically, such cores only implement theminimum features required by the specifica-tion to meet performance or compliancegoals. Hard IP, on the other hand, can sup-port a comprehensive feature list based oncustomer demand and full compliance(Table 1). There are no major performanceor resource-related issues.

    LatencyAlthough the latency of a PCI Express con-troller will not have a huge impact on over-all system latency, it does affect theperformance of the interface. Using a nar-rower data path helps latency.

    For PCI Express, latency is the number ofcycles it takes to transmit a packet and receivethat packet across the physical, logical, andtransaction layers. A typical x8-lane PCIExpress endpoint will have a latency of 20-25cycles. At 250 MHz, that translates into 80-100 ns. If the interface is implemented with a128-bit data path to make timing easier (suchas 125 MHz), the latency doubles to 160-200ns. Both the soft and hard IP implementa-tions in the latest Virtex-5 LXT and SXTdevices implement a 64-bit data path at 250MHz for x8 implementations.

    Link EfficiencyLink efficiency is a function of latency, userapplication design, payload size, and over-head. As payload size (commonly referred toas maximum payload size) increases, the

    Configuration and Capabilities Module

    Transaction Layer

    Data-LinkLayer

    PhysicalLayer

    RocketIO Transceivers

    To Fabric

    Virtex-5 LXT FPGA

    PCI Express Block

    GTP

    GTP

    GTP

    GTP

    GTP

    GTP

    GTP

    GTP

    Performance

    Lane Width Interface Data Width Interface Speed Bandwidth (each direction)

    x1 64 62.5/125/250 MHz 2 Gbps

    x2 64 62.5/125/250 MHz 4 Gbps

    x4 64 125/250 MHz 8 Gbps

    x8 64 250 MHz 16 Gbps

    PCI Express Specification v1.1 Compliance

    Requirement Support (Y/N)

    Clock Tolerance (300 ppm) Y

    Spread-Spectrum Clocking Y

    Electrical Idle Generate and Detect Y

    Hot Plug Y

    De-Emphasis Y

    Jitter Specifications Y

    CRC Y

    Automatic Retry Y

    QOS 2 VC/round robin, weighted round robin, or strict priority

    MPS 128-4096 bytes

    BARs Configurable 6 x 32 bit or 3 x 64 bit for memory or I/O

    Required Power Management States Y

    Figure 4 Virtex-5 LXT FPGA PCI Express endpoint block diagram

    Table 1 Virtex-5 LXT FPGA PCI Express capabilities

    16 Xcell Journal Third Quarter 2007

    C O N N E C T I V I T Y

  • Third Quarter 2007 Xcell Journal 17

    effective link efficiency also increases. This iscaused by the fact that the packet overhead isfixed; if the payload is large, the efficiencygoes up. Normally, a payload of 256 bytescan get you a theoretical efficiency of 93%(256 payload bytes + 12 header bytes + 8framing bytes). Although PCI Express allowspacket sizes up to 4 KB, most systems willnot see improved performance with payloadsizes larger than 256 or 512 bytes. A x4 or x8PCI Express implementation in the Virtex-5LXT FPGA will have a link efficiency of 88-89% because of link protocol overhead(ACK/NAK, re-transmitted packets) andflow control protocol (credit reporting).

    Using FPGAs for implementation givesyou better control over link efficiencybecause it allows you to choose the receivebuffer size that corresponds to the endpointimplementation. If both link partners do notimplement the data path in a similar way, theinternal latencies on both will be different.For example, if link partner #1 uses the 64-bit, 250-MHz implementation with a laten-cy of 80 ns and link partner #2 uses the128-bit, 125-MHz implementation with alatency of 160 ns, the combined latency forthe link will be 240 ns. Now, if link partner#1s receive buffer was designed for a latencyof 160 ns expecting that the link partnerwould also be a 64-bit, 250-MHz imple-mentation then the link efficiency would

    go down. With an ASIC implementation, itwould be impossible to change the size of thereceive buffer, and the loss of efficiencywould be real and permanent.

    User application design will also havean impact on link efficiency. The userapplication must be designed so that itdrains the receive buffer of the PCIExpress interface regularly and keeps thetransmit buffer full all the time. If the userapplication does not use packets receivedright away (or does not respond to trans-mit requests immediately), the overall linkefficiency will be affected regardless of theperformance of the interface.

    When designing with some processors,you will need to implement a DMA con-troller if the processors cannot performbursts longer than 1 DWORD. This trans-lates into poor link utilization and efficiency.Most embedded CPUs can transmit bursts

    longer than 1 DWORD, so the link efficien-cy for such designs can be effectively man-aged with a good FIFO design.

    PCI Express ComplianceCompliance is important detail that is fre-quently missed and often undervalued. Ifyou are building PCI Express applicationsthat must work with other devices andapplications, ensuring that your design iscompliant is a must.

    The compliance is not just for the IP butfor the entire solution, including the IP, userapplication, silicon device, and hardwareboard (Figure 5). If the entire solution hasbeen validated at a PCI-SIG PCI Expresscompliance workshop (also known as aplug fest), it is pretty much guaranteedthat the PCI Express portion of your designwill always work.

    ConclusionReplacing PCI, PCI Express has becomethe de facto system interconnect standardand has jumped from the PC into other sys-tem markets including embedded systemdesign. FPGAs are ideally suited to buildPCI Express endpoint devices, as they allowyou to create compliant PCI Expressdevices with the added customization thatembedded users desire.

    New 65-nm FPGAs like the Virtex-5LXT and SXT families are fully compliant tothe PCI Express specification v1.1 and offeran abundance of logic and device resourcesto the user application. The Spartan-3 fami-ly of FPGAs with external PHY offer a low-cost solution. These factors, combined withthe inherent programmable logic advantagesof flexibility, reprogrammability, and riskreduction, make FPGAs the best platformsfor PCI Express.

    Figure 5 Virtex-5 LXT FPGA PCI Express compliance workshop results

    Buy the Development Kit for PCI Express: Virtex-5 FPGA edition Spartan-3 FPGA edition

    Download the Protocol Pack for PCI Express Watch our live demo from the Embedded Systems Conference Register for a PCIe class

    TAKE THE NEXT STEP (Digital Edition: www.xcellpublications.com/subscribe/)

    C O N N E C T I V I T Y

  • by Luc LangloisGlobal Technical Marketing Manager, DSPAvnet [email protected]

    The incessant demand for higher band-widths and resolutions in communication,video, and instrumentation systems haspropelled the development of high-per-formance mixed-signal data converters inrecent years. This poses a challenge to sys-tem designers seeking to preserve theexceptional signal-to-noise specifications ofthese devices in the signal processing chain.Xilinx Virtex-5 FPGAs provide exten-sive resources for high-performance mixed-signal systems, supported by efficientdevelopment tools spanning all phases ofdesign, from system-level exploration tofinal implementation.

    Key Specifications of Data Converters A typical mixed-signal processing chainstarts at the analog-to-digital converter(ADC). Modern high-performance ADCsprovide sampling rates extending into thehundreds of megasamples per second(MSPS) for 12- and 14-bit devices. Forexample, the Texas Instruments ADS5463

    ADC provides 12 bits at 500 MSPS, with64.5 dB full-scale (dBFS) of signal-to-noise ratio (SNR) to 500 MHz.

    Fast sampling rates offer several bene-fits, including the ability to digitize wide-band signals, reduced complexity ofanti-alias filters, and lower noise powerspectral density. The result is improvedSNR in the system. Your challenge is toimplement the high-speed interfacebetween the data converter and FPGAwhile preserving the SNR throughout thesignal processing chain in the FPGA.

    Before the digital ADC data is cap-tured in the FPGA, you must take carefulprecautions to minimize jitter on thedata converter sampling clock. Jitterdegrades SNR depending on the signalbandwidth of interest. For example, pre-serving 74 dB of SNR - or approximate-ly 12 effective number of bits (ENOB) -for signal bandwidths extending to 100MHz requires a maximum 300 fs (femto-seconds) of clock jitter. Modern ADCsprovide clever interfaces that simplifydistribution of clean low-jitter clocks onthe board. Lets examine how key featuresof Virtex-5 FPGAs are used to imple-ment these interfaces.

    High-Performance ADC Interface High-performance ADC sampling ratesoften exceed the minimum rate necessaryto avoid aliasing, or Nyquist rate, definedas twice the highest frequency componentin the analog input signal. The highly over-sampled digital signal entering the FPGAneed not maintain a fast sampling ratethroughout the signal processing chain; itcan be decimated with negligible distortionin the digital domain by a high-quality dec-imation filter. This offers the benefits of aslower system clock in subsequent process-ing stages for easier timing closure andlower power consumption.

    Xilinx Virtex-5 and Spartan-3A DSPFPGAs provide the ideal resources toimplement high-performance decimationfilters for fast ADCs using a techniqueknown as polyphase decomposition. Apolyphase decimation filter performs asampling rate change by allocating the DSPworkload among a set of D sub-filters,where D = decimation rate. Each subfilteris only required to sustain a throughput offs/D, a fraction of the fast incoming sam-pling rate fs from the ADC.

    As the decimation filter is often thefirst stage of digital processing, it calls for

    18 Xcell Journal Third Quarter 2007

    Virtex-5 FPGA Techniques for High-Performance Data ConvertersYou can harness the DSP resources of Virtex-5 devices to interface to the analog world.

    C O N N E C T I V I T Y

  • the highest performance resources closestto the FPGA pins. A Virtex-5 FPGAsinput/output block contains an IDDR(input double-data-rate register) drivendirectly from the FPGA input buffer.Several differential signal standards aresupported, including LVDS, which cansustain in excess of 1-Gbps data rateswhile providing excellent board-levelnoise immunity.

    The IDDR is used to de-multiplex thefast incoming digital signal from the ADCinto two single-data-rate data streams, eachat one-half the ADC sampling rate. This isthe ideal format to feed a 2x polyphase dec-imation filter. Using the Virtex-5 DSP48E,each subfilter can sustain 550 MSPS for amaximum 1.1-GSPS ADC sampling rate.Similarly, the Spartan-3A DSP can sustain500 MSPS ADC sampling rates.

    With the benefits of faster ADC sam-pling rates come the challenges of smallerdata-valid windows to latch the data intothe FPGA. Furthermore, the wider theADC data-word precision, the moredaunting the layout task and the higherpotential for skew across individual signalsof the data bus, resulting in corrupteddata. Virtex-5 FPGAs provide a robustsolution called IODELAY, a programma-ble delay element contained in every I/Oblock. IODELAY can individually time-shift each signal of the data bus to accu-rately position the data-valid window atthe optimal transition of the half-rate data-ready signal (DRY).

    Figure 1 illustrates the unique features inVirtex-5 devices that serve to implement ahigh-performance ADC interface. To mini-mize sampling jitter, the ADC forwards thesource-synchronous DRY signal along withthe data, while the clean, low-jitter sam-pling clock routes directly to the ADCwithout passing through the FPGA.

    High-Performance DAC InterfaceFor equal data-word precision, digital-to-analog converters (DACs) typically offerhigher sampling rates than ADCs, resultingin significant design challenges at the DACextremity of the signal chain. Several fea-tures of the Virtex-5 architecture can helpsurmount the task. For example, consider

    to situate the polyphase interpolator multi-plexer as close as possible to the LVDS out-put buffers driving the DAC5682Z.Virtex-5 FPGAs provide a dedicatedresource within the I/O block that is ideal-ly suited to this purpose: the ODDR (out-put double-data-rate) registers. The ODDRroutes directly to fast LVDS differentialoutput buffers able to sustain output ratesof 1 GSPS (and beyond) while maintainingsignal integrity on the PCB.

    ConclusionIn this article, Ive presented DSP andinterface techniques for mixed-signal sys-tems using Xilinx Virtex-5 FPGAs. You canoptimize system performance by preservingthe outstanding SNR specifications ofmodern high-performance data convertersusing key features of Virtex-5 devices.

    The techniques described in this articlewill be featured in Speedway 2007 DSPsessions, in collaboration with majorAvnet data converter suppliers TexasInstruments, Analog Devices, andNational Semiconductor. For details, visithttp://em.avnet.com/xilinxspeedway.

    the Virtex-5 interface to a TexasInstruments (TI) DAC5682Z 16-bit dual-DAC with 1-GSPS sampling rate andLVDS inputs.

    In a practical system, the 1-GSPS sam-pling rate need only be deployed at thefinal output stage to the DAC, while inter-mediate stages in the FPGA signal process-ing chain can work at a slower samplingrate commensurate with the signal band-width. This allows a slower system clock inthe intermediate processing stages, with thebenefits of easier timing closure and lowerpower consumption.

    As is the case for the ADC, polyphase fil-ters are efficient DSP structures to realizesampling rate changes at the DAC end of thesignal chain. To attain a 1-GSPS output sam-pling rate to the TI DAC5682Z, a 2xpolyphase interpolation filter uses two sub-filters, each with a throughput of 500 MSPS.These rates are within the performance spec-ifications of the Virtex-5 DSP48E slice.

    A multiplexer is required to combine theoutputs of the sub-filters to attain the fastoutput rate from the polyphase. For a 1-GSPS output sampling rate, it is advisable

    Third Quarter 2007 Xcell Journal 19

    2X Polyphase DecimatorI/O Block

    IDD

    R

    DRY

    Low-Jitter

    Sampling Clock

    DATA

    IODELAY

    Sub-Phase 0

    Sub-Phase 1ADC

    XtremeDSP

    XtremeDSP

    2X Polyphase Interpolator

    OD

    DR IODELAY

    Sub-Phase 0

    Sub-Phase 1

    DAC

    Virtex-5 FPGA

    I/O Block

    XtremeDSP

    XtremeDSP

    Low-Jitter

    Sampling Clock

    Figure 1 - High-performance ADC interface

    Figure 2 - DAC: polyphase interpolation + ODDR + LVDS

    C O N N E C T I V I T Y

  • by Andreas [email protected]

    Ethernet is playing an increasingly impor-tant role today, as it is used anywhere toconnect everything. Not surprisingly,bandwidth requirements are increasing inthe backbone as well as in end systems.

    The implication of this increasing band-width is an increased traffic processing loadin end systems. Today, most end systemsuse one or more CPUs with an OS and anetwork stack to implement network inter-face functions. For many applications, theincreasing traffic load leads to performanceissues in the network stack implementa-tion. As these performance issues are seenalready at 1 Gigabit Ethernet (GbE),implementing 10 GbE using a softwarestack is not a viable solution.

    To solve these problems, IPBlaze hasdeveloped a unique and highly configurableTCP/IP offload engine (TOE). The TOEprocesses the TCP/IP stack at wire speed in

    hardware instead of using a host CPU andthus reduces its processing burden.

    IPBlaze has implemented the TOE forvarious end systems. Our measurementresults show that using an IPBlaze TOEreduces latencies and CPU utilization con-siderably. In this article, Ill give anoverview of the implementation optionsavailable today for Ethernet applicationsand show where the IPBlaze TOE can giveyou the upper edge in terms of perform-ance. Figure 1 shows performance exam-ples with and without the IPBlaze TOE.

    Ethernet ConfigurationsTable 1 shows the five different implemen-tation options available today.

    A CPU plus a simple network interfacecard (NIC) is a very general solution usedin most PCs and servers today. A newCPU generation always provides higherperformance and thus also increases net-work performance. However, the increasein network processing performance is sig-nificantly lower than the increase in CPUpower. The same problem exists with

    embedded CPUs the performance issuesarise at 10 to 100 times lower data rates.

    A high-performance CPU and simpleNIC is a flexible solution with a well-known API (socket), and it is easy toimplement applications in commonPC/server environments. One of thedownsides is the difficulty in scaling effi-ciently to 10 GbE because the load distri-bution between CPU cores and processintercommunication creates a bottleneck.Power dissipation (heat) is also a limitingfactor in many systems.

    Some NIC acceleration ASICs on themarket perform protocol offload for specif-ic applications such as storage. A high-per-formance CPU and NIC accelerationASIC is a good solution if the ASIC sup-ports all of the features needed (iSCSI,TCP offload). The functionality and band-width is fixed, however, which makes itvery hard to add functionality or adapt toprotocol changes without a huge perform-ance impact.

    FPGAs are so powerful today that it ispossible to implement an accelerated 10

    20 Xcell Journal Third Quarter 2007

    Reducing CPU Load for Ethernet ApplicationsA TOE makes 10 Gigabit Ethernet possible.

    C O N N E C T I V I T Y

  • GbE NIC in an FPGA at a competitivecost point. One example is the IPBlazeTOE implemented in Xilinx Virtex-4and Virtex-5 devices. The Virtex-5 LXTdevice has hardware support for PCIe (PCIExpress), which makes it an attractivechoice to use in PCs and servers.

    Examples of ASIC SoCs are controller-type ASICs and communication controllerASICs. This is a good solution for multi-function systems with limited bandwidth(up to 10-20 Mbps). However, it is limitedby low bandwidth.

    A powerful example of an FPGA SoCsolution is an FPGA with an IPBlaze TOEand an embedded CPU (a PowerPC hard-core or MicroBlaze soft-core processor).Applications processing can be done in

    Application ExamplesThe IPBlaze TOE is a general protocoloffload engine including TCP offload.The IPBlaze TOE product lines include anumber of TOEs with different function-ality targeting different applications(Figure 2).

    One example is an FPGA working as anintelligent high-performance TOE NIC atn-times-1 GbE or n-times-10 GbE con-nected to a PC through the PCIe interface.The NIC can be customized with add-onfunctions such as advanced switch flowcontrol. You can also easily add protocolprocessing functions such as RDMA andiSCSI for storage applications.

    Another example would be embeddedvideo applications where Ethernet is usedas a backplane. The TOE can transfer theimages or video streams without the needfor a CPU.

    High Flexibility with FPGAsFPGAs hold a number of advantages com-pared to ASICs and offer scalable solutionswith high data rates: 1 GbE, n-times-1GbE, 10 GbE, and n-times-10 GbE. Theflexibility of the FPGA makes it possible toadd new functionality and adapt to proto-col changes. Advanced network functionslike switch-specific flow control, or per-formance-enhancing features can beadded. In load distribution, the IPBlazeTOE can distribute traffic to multipleCPU cores for higher system performance.Maintenance and upgrade of the TOEhardware functions located in the FPGA isdone by firmware upgrade much likedriver software updates.

    IPBlaze 10 GbE TOE coreThe IPBlaze 10 GbE TOE core is animplementation of a full-featuredTCP/IP networking subsystem in theform of a drop-in silicon IP core. Itincludes a standards-compliant TCP/IPstack implemented as a compact high-performance synchronous state machine.

    The TOE core is built from a collec-tion of synthesizable Verilog modules,which are customized before delivery toprovide the best possible performanceand feature set in a given system.

    software by the PowerPC while the FPGAhardware, together with the TOE, handlesbulk transfers like video or images at wirespeed (1 or 10 GbE). This solution makesit possible to use FPGA logic for fast dataprocessing and fast data communicationand software for complex low-bandwidthoperations. Software in the embeddedCPU can be Xilinx microkernel,Embedded Linux, or VxWorks.

    The IPBlaze 10 GbE TOE NIC FPGAsolution has the same performance as aTOE NIC ASIC while providing muchhigher flexibility. High-performanceTOE NICs are now available using cost-effective FPGAs. FPGAs help you withfast time to market and ensure compati-bility and interoperability.

    Third Quarter 2007 Xcell Journal 21

    High Performance Properties

    High-performance CPU + simple NIC High CPU load, CPU limits performance

    High-performance CPU + NIC acceleration ASIC Good with in supported feature set, but inflexible solution

    High-performance CPU + NIC acceleration FPGA Flexible, scalable solution

    Medium Performance Properties

    ASIC SoC (communication controller CPU, network if) General solution, performance limited by software

    FPGA SoC (integrated CPU, network acceleration) Good for high data rates, flexible and scalable

    Ethernet Performance Measured by netperf

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    80%

    90%

    100%

    IPBlaze TOE,

    Intel M1.6,

    1500B pck

    Intel Dual-

    Core E6600,

    250B pck

    Intel Dual-

    Core E6600,

    1500B pck

    AMD Athlon

    64 3700+,

    1500B pck

    Standard NIC,

    Intel M1.6,

    1500B pck

    IPBlaze TOE,

    Intel M1.6,

    250B pck

    CP

    U L

    oad

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    Mb

    /s W

    ire

    Ra

    te T

    CP

    Pa

    ck

    ets

    CPU Load

    Throughput

    Table 1 Overview of implementation options

    Figure 1 The CPU power needed to process network data and throughput, tested at 1 GbE links, using the netperf performance measurements program.

    C O N N E C T I V I T Y

  • The TOE core can be instantiatedalong with your own IP and configuredto operate without the need for hostCPU support.

    The IPBlaze TOE core can support asmany as 1,000 concurrent TCP connec-tions in on-chip memory. When a highernumber of connections are required, aconnection cache will allow you to access

    the most recently used connection stateson-chip. The state information for non-cached TCP connections is stored in hostmemory or in dedicated external memory.

    For applications where the TOE doesnot terminate connections but monitorsall connections on a backbone, an externaldedicated high-speed memory is used tosupport a very high number (1 million andabove) of concurrent TCP connections.

    The IPBlaze TOE core uses three well-defined interfaces:

    Xilinx 1 GbE and 10 GbE MAC net-work interfaces for either 1 GbE or 10GbE support

    A high-speed RAM option for dedicat-ed external memory for TCP connec-tion state information

    Xilinx PCIe type or memory hostprocessor interface

    The host interface is essential for systembehavior and performance and represents asignificant part of the system complexity.

    The host CPU has register access toTOE registers while the TOE has access tomain memory. The main memory containsthe send and receive data buffers, queuestructures, and TCP connection stateinformation if connection caching is

    enabled. An event system is used for fastmessage signaling between the host andthe TOE. The event system includes anumber of optimizations for highthroughput and low latency.

    OS-Compliant Socket API Applications software interfaces to theTOE through a fully standards-compliantsocket API, with significantly higher per-formance than a standard software socketimplementation. The socket API is imple-mented at the CPU kernel level. The CPUload is reduced by 5-10 times and latency isimproved by 4-5 times.

    Hardware-based acceleration should beable to scale well. The evolution of per-formance in future Xilinx FPGA genera-tions provides a good upgrade path.Furthermore, continuous improvements inthe IPBlaze TOE technology also provideincreased performance.

    The flexibility in an FPGA solutionallows easy upgrade of the FPGA code tosupport new protocol features.

    The IPBlaze TOE core is designed totarget a variety of FPGA families depend-ing on the performance and functionalityrequirements. The IPBlaze TOE core hasbeen implemented in Virtex-II, Virtex-4,and Virtex-5 devices.

    Here are some example numbers froma 10 GbE TOE implemented in Virtex-5devices:

    Core clock: 125 MHz

    TOE latency: 560 ns

    Throughput: 10G line rate (10G bidirectional)

    Packet processing rate: 12 millionpackets per second

    Xilinx and IPBlaze provide TCP offloadsolutions that can be implemented as is orcustomized for functionality, size, speed, ortarget application. IPBlaze also offers acomprehensive set of high-performanceTOE solutions for different market seg-ments. Current TOE solutions include:

    IPBlaze General TOE for NIC applications.

    IPBlaze Security TOE for high-per-formance (2-times-10 GbE) bump-in-the-wire network monitoring. Thesecurity TOE supports more than 1 million concurrent TCP connections.

    IPBlaze Embedded TOE for SoCapplications that require high-speednetwork communication (1 GbE and10 GbE).

    ConclusionWith standards-compliant TOE cores,IPBlaze has created a technology platformfor the 10 GbE TOE market, working withindustry leaders in high-performance com-puting solutions. Programmable solutionsenable system architects to add functionali-ty as needed. Integrating multiple IP coresinto a single FPGA can reduce board costsand time to market.

    To learn more about protocol offloadsolutions, visit www.ipblaze.com.

    22 Xcell Journal Third Quarter 2007

    Virtex-5 FPGA

    High-Performance CPU

    CPU + Accelerated NIC

    FPGA SOC

    Virtex-4 or Virtex-5 FPGAs

    Network Interface Network Interface

    n*1 GbE/10 GbE PHY

    Applications in SW

    Applications in HW

    e.g., Video CPUPCIe

    IPBlaze TOE

    n*1 GbE/10 GbE MAC

    n*1 GbE/10 GbE PHY

    IPBlaze TOE

    n*1 GbE/10 GbE MAC

    Figure 2 Block diagram for a high-speed TOE solution for a PC/server and embedded system

    C O N N E C T I V I T Y

  • by Brad FriedenApplications Development EngineerAgilent [email protected]

    When implementing high-speed serial linksin FPGAs, you must consider and addressthe effects of transmission line signalintegrity. Used together, transmitter pre-emphasis and receiver equalization extendthe rate at which high-speed serial data canbe transferred by opening up the eye dia-gram, despite physical channel limitations.

    An internal bit error ratio tester(IBERT) measurement core from Xilinxcan view serial signals at an internal receiv-er point. Used in conjunction with theAgilent Serial Link Optimizer, you canhave both a graphical view of the BERacross the unit interval and automaticallyadjust pre-emphasis and equalization set-tings to optimize the channel.

    In this article, Ill show you how tooptimize a Virtex-4 MGT high-speedserial link through this process and dis-cuss the results.

    Challenges of Signal DegradationAt 3.125- and 6-Gbps rates and risetimes of 130 ps or shorter, it is no won-der that most applications end up withsignificant signal integrity effects fromthe physical channel that distort the sig-nal at the receiver input. Distortion cancome from multiple reflections causedby impedance discontinuities, but amore fundamental effect especially inFR4 dielectric PC boards slows downthe edge speeds. Frequency-dependentskin effect causes a slow tail to thepulse. To compensate for this, you canapply a time-domain technique calledpre-emphasis to the transmit pulse, withsignificant improvement at the receiver.

    Additionally, because the channel is band-width-limited, you can apply a frequency-domain technique called equalization at thereceiver to compensate for channel frequencyroll off. A peaked frequency response yields amore flat response when combined with thechannel roll off. The effects of equalization atthe receiver input are quite drastic, but this isonly visible inside the chip at the actualreceiver input (post-equalization).

    Measuring Link PerformanceIt is important to be able to verify the per-formance of a link and to optimize itthrough the combined adjustment oftransmitter pre-emphasis and receiverequalization. Unfortunately, taking a meas-urement on the pins of the FPGA wherethe receiver is located yields a very distort-ed signal, since you are observing the signalwith pre-emphasis applied but withoutcorresponding equalization.

    Third Quarter 2007 Xcell Journal 23

    Automated MGT Serial Link Tuning Ensures Design MarginsYou can now streamline the serial link tuning process using Agilents Serial Link Optimizer tool with Xilinx IBERT measurement cores.

    C O N N E C T I V I T Y

  • Figure 1 is an example of such a meas-urement made with an Agilent digital com-munications analyzer on a 6-Gbps channelimplemented with ASIC technology.Notice that the eye diagram actuallyappears completely closed, even thoughgood signals are present at the receiverinput inside the chip.

    IBERT Measurement CoreFortunately, you can observe what isgoing on at the FPGAs receiver input byusing an IBERT measurement core that ispart of the Xilinx ChipScope ProSerial I/O toolkit. The normal FPGAdesign is temporarily replaced with onethat creates stimulus at the transmitterand measures BER at the receiver. Thesestimulus/response core pairs are placed inthe design using a core generationprocess. Now it is possible to observeBER at the receiver inside the chip. Usingthat basic measurement capability, youcan apply a variety of pre-emphasis andequalization combinations to optimizethe channel response.

    Basic BERT MeasurementsTo understand link performance, you musthave the ability to measure bit error rate.To do this, a new tool called the AgilentSerial Link Optimizer takes control ofIBERT core stimulus and response in theserial link to create such a measurement. Ameasurement system as shown in Figure 2allows for this kind of test. Here, twoVirtex-4 FPGAs comprise the link: oneimplements the transmitter, the other thereceiver. Both FPGAs are under JTAG con-trol from the Serial Link Optimizer soft-ware, and measurements are taken at thereceiver input point inside the FPGA tomeasure BER.

    Some of the selectable measurementattributes include:

    Loopback mode (internal, external, or none)

    Test pattern type

    Dwell time at each point across the unit interval

    Manual injection of errors

    with the topology of the ML405 boards,dictate the physical channel.

    Steps to set up this measurement include(assuming that the IBERT cores are alreadycreated and loaded into the FPGAs):

    1. Start tool and configure USB JTAGfor TX and verify connection

    2. Configure parallel JTAG for RX andverify connection

    3. Select MGT 113A transmitter onUSB cable

    I obtained a measurement on a seriallink that comprises a Xilinx Virtex-4XC4VFX20 multi-gigabit transceiver(MGT) 113A transmitter on one XilinxML405 board, with a connection througha SATA cable over to an MGT 113Areceiver in a second Virtex-4 XC4VFX20FPGA on a second ML405 board. I madea USB JTAG connection to the firstFPGA and IBERT core associated withthe transmitter, and a parallel JTAG con-nection to the second IBERT core associ-ated with the receiver. These cores, along

    Transmitter Receiver with Equalizer

    Agilent DCA at PinsEmbedded BERT

    Backplane

    PC Board

    FPGA

    IBERT

    PC Board

    FPGA

    IBERTInsert Core with

    FPGA Design Software

    JTAG

    Software Application on Windows PC

    Xilinx Cable

    Parallel orUSB

    Figure 2 Serial Link Optimizer block diagram

    Figure 1 DCA measurement on FPGA serial I/O pins versus on-chip measurement at the receiver input

    24 Xcell Journal Third Quarter 2007

    C O N N E C T I V I T Y

  • Third Quarter 2007 Xcell Journal 25

    4. Select MGT 113A receiver on parallel cable

    5. Select loopback (external)

    6. Select test pattern type (PRBS7)

    7. Setup TX and RX line rates (referenceclock 150 MHz, line rate 6 Gbps)

    8. Select BERT tab and press Run

    I made a measurement on the link withzero errors after 1E+12 bits (~ 3-min meas-urement). This measurement of BER onthe channel occurred at the receiver inputinternal to the chip and required no exter-nal measurement hardware. It forms thebasis for additional capability in the SerialLink Optimizer tool.

    A Graphical View of BERThe next step toward understanding linkperformance is to have the ability to graphBER as a function ofthe unit interval. To dothis, the Agilent SerialLink Optimizer takescontrol of IBERT corestimulus and responsein the serial link to cre-ate such a graph.

    The same measure-ment system is used aswith the basic BERtest, but now measure-ments are taken at 32discrete steps across theunit interval to showwhere error-free per-formance is achieved.The resulting graph isshown in the upper BER plot in Figure 3.The Virtex-4 MGT macro has the abilityto adjust the sampling position acrossthese 32 discrete points; the Serial LinkOptimizer uses the IBERT core in con-junction with sampling position controlto make the measurements. The link per-formance with default MGT pre-empha-sis and equalization settings has zeroerrors for 0.06 (6%) of the unit interval.This is quite narrow, indicating very littlemargin. The system could benefit fromproper link tuning.

    Automated Link TuningLets extend this measurement process yetagain by automatically trying combinationsof pre-emphasis and equalization settingsuntil the error-free zone in the unit intervalis maximized, resulting in the best link per-formance for speed and margin. The SerialLink Optimizer does just that. Link config-uration options include an internal loop-back test inside the I/O ring of a singleFPGA; transmit and receive from oneFPGA; and a transmit/receive pair testbetween two different FPGAs. It is also pos-sible to inject signals from an externalBERT instrument and monitor signals atthe receiver inside the FPGA.

    In our example, lets use the same serialchannel between two Virtex-4 FPGAs withthe MGT 113A/B TX/RX pair. On theSerial Link Optimizer interface, select theTuning tab and Run button as shown inFigure 3. The error-free zone with default

    FPGA settings was 0.06 unit interval. Afterattempting 141 combinations of pre-emphasis and equalization, the error-freezone increased to 0.281 (28%) unit inter-val, also shown in Figure 3. This means sig-nificantly better margins in the link designat the 6-Gbps speed.

    Exporting Link Design ConstraintsThese new pre-emphasis and equalizationsettings must now be incorporated into a realdesign that ultimately uses the serial link. Upuntil this point, I placed a temporary test

    design into each FPGA to determine the opti-mal combination of pre-emphasis and equal-ization. Now that I have determined theoptimal combination, the Serial LinkOptimizer outputs the corresponding pre-emphasis and equalization settings for theMGTs. The MGT parameters are representedin VHDL, Verilog, and UCF formats and areoutput when clicking the Export Settingbutton. You can cut and paste these parame-ters back into the source design, thus incor-porating the optimal link settings when thefinal design is programmed into the FPGAs.

    For example, the Verilog settings for thefirst transceiver are:

    // Verilog

    // Rocket IO MGT Preemphasis and Equalization

    defparam MGTx.RXAFEEQ = 9b111;

    defparam MGTx.RXSELDACFIX = 5b1111;

    defparam MGTx.RXSELDACTRAN = 5b1111;

    Required ComponentsTo take advantage of these measurements,you will need:

    1. A PC with Windows XP installed (SP2or higher)

    2. Xilinx programming cable(s), paralleland/or USB

    3. Xilinx ChipScope Pro Serial I/OToolkit (for IBERT core)

    4. Agilent E5910A Serial Link Optimizersoftware

    ConclusionThrough automated adjustments of pre-emphasis and equalization, while simultane-ously monitoring BER at the receiver inputinside the FPGA, it is now possible to auto-matically optimize an MGT-based serial link.

    At 3.125-Gbps rates, such an optimiza-tion is helpful to get good margins. But at6-Gbps rates, optimization becomes crucialto achieve link performance that ensuresreliable data transfer. This process can savesignificant time in reaching your desireddesign margins and can likely achieve betterresults than what is possible through tradi-tional manual tuning.

    For more information, visit www.agilent.com/find/serial_io or www.agilent.com/find/xilinxfpga.

    Figure 3 BER plot before and after automatic adjustment of pre-emphasis and equalization to tune the serial link

    C O N N E C T I V I T Y

  • by Mrinal J. Sarmah Hardware Design Engineer Xilinx, Inc. [email protected]

    Hemanth PuttashamaiahHardware Design EngineerXilinx, [email protected]

    With advances in communication technol-ogy, you can achieve gigahertz data-transferrates in serial links without having to maketrade-offs in data integrity. The prolifera-tion of serial connectivity can be attributedto its advantages over parallel communica-tion, including:

    Improved system scalability

    More flexible, thinner cabling

    Increased throughput on the line withminimal additional resources

    More deterministic fault isolation

    Predictable and reliable signalingschemes

    Topologies that promise to scale to the needs of the end user

    Exceptional bandwidth per pin withvery high skew immunity

    Reduced system costs because of smallerform factors, fewer PCB traces andlayers, and lower pin/wire count

    Although it offers real benefits, serialI/O has some negative attributes. Serialinterfaces require high-bandwidth manage-ment inside the chip, special initializationand monitoring, bonding of lanes in anaggregated channel of multiple lanes, elas-tic buffers for data alignment, and de-skew-ing. Also, flow control is complex and youmust maintain the correct balance betweenhigh-level features and total chip area.

    Multi-Gigabit TransceiversFollowing the industry-wide migrationfrom parallel to serial interfaces, Xilinxintroduced multi-gigabit transceivers(MGTs) to meet bandwidth requirementsas high as 6.5 Gbps.

    The common functional blocks in thesetransceivers are an 8B/10B encoder/decoder,transmit buffer, SERDES, receive buffer,loss-of-sync finite state machine (FSM),comma detection, and channel bondinglogic. These transceivers have built-in clockdata recovery (CDR) circuits that can per-form at gigahertz rates. Built-in phase-locked loops (PLL) generate the fabric andtransceiver clocks.

    Transceivers have several advantages:

    With their self-synchronous timingmodels, they reduce the number oftraces on the boards and eliminateclock-to-data skew

    Multiple MGTs can achieve higherbandwidths

    MGTs with point-to-point connec-tions make switched fabric architec-tures possible

    The elastic buffers available in MGTsprovide high skew tolerance for channel-bonded lanes

    MGTs are designed for configurablesupport of multiple protocols; hencetheir control is fairly complex. Whendesigning or integrating designs with ahigh-speed connectivity solution usingMGTs, you will have to consider MGTinitialization, alignment, channel bond-ing, idle sequence generation, link man-agement, data delineation, clock skew,clock compensation, error detection, anddata striping and destriping. Configuringtransceivers for a particular application ischallenging, as you are expected to tunemore than 200 attributes.

    The Aurora SolutionThe Xilinx Aurora protocol and its asso-ciated designs address these challenges bymanaging the MGTs control interface.

    Aurora is free, small, scalable, and cus-tomizable. With low overhead, Aurora is aprotocol-agnostic, lightweight, link-layerprotocol that can be implemented in anysilicon device/technology.

    With Aurora, you can connect one ormore MGTs to form a communicationchannel. The Aurora protocol defines thestructure of data packets and procedures for

    26 Xcell Journal Third Quarter 2007

    A High-Speed Serial ConnectivitySolution with Aurora IPAurora is a highly scalable protocol for applicationsrequiring point-to-point connectivity.

    C O N N E C T I V I T Y

  • flow control, data striping, error handling,and initialization to validate MGT links.

    Aurora shrink-wraps MGTs by providinga transparent interface to them, allowing theupper layers of proprietary or industry-stan-dard protocols such as Ethernet and TCP/IPto ride on top of it and provide easily access.

    This easy-to-use pre-defined protocolneeds very little time to integrate withexisting user designs. Being lightweight,Aurora does not have an addressing schemeand cannot support switching. It does notdefine correction within data payloads.

    Aurora is defined for physical and data-link layers in the Open SystemsInterconnection (OSI) model and can easi-ly integrate into existing networks.

    A Typical Connectivity ScenarioFigure 1 is an overview of a typical Auroraapplication. The Aurora interface trans-fers data to and from user applicationfunctions through the user interface. The

    a 2- or 4-byte interface should bebased on the throughput require-ments and latency involved. A 4-byte design has more latencythan a 2-byte design, but offersbetter throughput and consumesless resources than an equivalent2-byte design.

    Aurora channels can functionas uni-directional (simplex) orbi-directional (full-duplex). Full-duplex module initializationdata comes from the channelpartner through a reverse path,whereas in simplex the initializa-tion happens through four side-band signals. Aurora is availablein simplex TX only, RX only, orboth. Both operation is likefull duplex, except that the

    transmit and receive sections communicateindependently. You can use Aurora simplexin token-ring fashion. Figure 2 shows thering structure of Aurora-S (simplex).

    Data Flow in AuroraData is transferred between Aurora channelpartners as frames. Data flow primarily com-prises the transfer of protocol data units(PDUs) between the user application andthe Aurora interface, as well as the transfer ofchannel PDUs between channel partners.

    Soon after power-up, when the corecomes out of all reset states, the trans-mitter sends initialization sequences. Ifthe link is fine and the link partner rec-ognizes these patterns, it sends acknowl-edgement patterns. After receivingsufficient acknowledgement patterns, thetransmitter transits to a link-up state,indicating that the individual transceiverconnection between channel partners isestablished. Once the link is up, theAurora protocol engine moves to a chan-nel verification stage for a single-lanechannel, or a channel-bonding stage(preceeding a verification stage) for amulti-lane channel. When channel verifi-cation is complete, Aurora sends a chan-nel-up signal, which is followed by actualdata transmission. The link initializationprocess is shown in Figure 3.

    user interface is not part of the Auroraprotocol specification.

    The Aurora protocol engine convertsgeneric arbitrary-length data from theXilinx LocalLink user interface to Auroraprotocol-defined frames and transfers itacross channel partners comprising one ormore high-speed serial links. The numberof links between channel partners is config-urable and device-dependent.

    Most Xilinx IPs are developed based onthe legacy LocalLink interface. Any userinterface designed for other LocalLink-based IPs can be directly plugged intoAurora. For more information on theLocalLink interface, visit www.xilinx.com/products/design_resources/conn_central/locallink_member/sp006.pdf.

    You can think of Aurora as a bridgebetween the LocalLink interface and theMGT.

    Auroras LocalLink interface is customiz-able for 2- or 4-byte data. Your selection of

    Third Quarter 2007 Xcell Journal 27

    UserApplication

    Local Link

    Interface

    Local Link

    Interface

    UserApplication

    UserInterface

    UserInterface

    Lane 1

    Lane NChannel

    MGT

    MGT

    MGT

    MGT

    AURORA

    AURORA

    User Frames andFlow Control Messages

    User Frames andFlow Control Messages

    Interconnect(Ex. Backplane)

    Aurora Frames

    High-Speed Serial TransceiversMGT = Multi-Gigabit Transceiver

    Aurora Channel Partners Aurora Full-Duplex Channel

    Each Aurora Lane Uses One MGT

    status status status

    FPGA FPGA

    RX TX RX TX

    Figure 1 Connectivity scenario using Aurora

    Figure 2 Simplex token ring structure

    C O N N E C T I V I T Y

  • Frame TypesAs stated previously, data is sent over Aurorachannels as frames. There are five frametypes, listed in the order of their priority:

    Clock compensation (CC)

    Initialization sequences

    Native flow control (NFC)

    User flow control (UFC)

    Channel PDUs

    Idles

    Aurora allows channel partners to useseparate reference clocks by inserting CCsequences at regular intervals. It canaccommodate differential clock rates up to200 parts per million (ppm) between thetransmitter and the receiver.

    An Aurora frame sends data in a quanti-ty of two bytes called a symbol. In case thenumber of bytes in the PDU is odd, itappends one extra byte called a pad. On thereceive side, the pad is removed by thereceive logic and data is presented to theLocalLink interface.

    The transmit LocalLink frame is encap-sulated in the Aurora frame, as shown inFigure 4. Aurora encapsulates frames withstart-channel PDU (SCP) and end-channelPDU (ECP) symbols. On the receive side,the reverse process occurs. The transceiversare responsible for encoding/decoding theframe into 8B/10B characters, serialization,and de-serialization.

    Flow ControlAurora supports an optional flow controlmechanism to prevent data loss caused bydifferent source and sink rates betweenchannel partners (Figure 5).

    Native Flow ControlNFC is meant for data-link layer-rate con-trol. NFC operation is governed by twoNFC FSMs: RX and TX.

    The RX NFC FSM monitors the state ofthe RX FIFO. When there is a risk of over-flow, it generates NFC PDUs, requesting thatthe channel partner pause transmission ofuser PDUs for a specific time duration. TheTX NFC state machine responds by waitingduring the requested time, allowing the RX

    FIFO to come out of its overflow state.While sending NFC requests, the TX

    NFC FSM should eliminate any round-trip delay. Ideally, NFC requests are sentbefore receive FIFO overflows to accountfor this delay. You can program the NFCpause value from 0 to 256; the maximumpause value is infinite. An NFC pause valueis non-cumulative and new NFC requestvalues override the old.

    There are two NFC request types: imme-diate mode and completion mode. In imme-diate mode, an NFC request is processedwhile low-priority requests are halted. Incompletion mode, the current frame is

    transmitted; only after competition of thetransfer is the NFC request processed.

    User Flow ControlUFC is used to implement user-definedflow control at any layer. The number ofUFC messages can be 2 to 16 bytes in aframe. UFC messages are generated andinterpreted by the user application.

    ApplicationsAurora is a simple, scalable open protocolfor serial I/O. It can be implemented inFPGAs or ASICs. Aurora can be appliedwhere an inexpensive, high-performance

    28 Xcell Journal Third Quarter 2007

    Start

    Align SERDES

    Polarity Corrected

    Align Polarity

    Align SERDES

    Operation

    Lane Verification

    PartnerHandshake

    ChannelBonding

    Multi-Lane

    Padding

    Link-Layer Encapsulation

    8B/10B Encoding

    N Octets

    N Octets One Octet

    One Octet Two OctetsTwo Octets N Octets

    N SymbolsTwo Symbols Two SymbolsOne Symbol

    User PDU

    User PDU PAD

    User PDU PADSCP ECP

    User PDU PADSCP ECP

    Figure 3 Data flow in Aurora

    Figure 4 Frame encapsulation in transmit Aurora

    C O N N E C T I V I T Y

  • Third Quarter 2007 Xcell Journal 29

    link layer with little resource consumptionis required, saving many hours of work forusers who were considering developingtheir own MGT protocols.

    Aurora has a variety of applications,including:

    Video

    Medical

    Backplane

    Bridging

    Chip-to-chip and board-to-board com-munication

    Splitting functionality among severalFPGAs

    Simplex communication

    Aurora provides more throughput forfewer resources. Plus, it can adapt to cus-tomer needs, allowing additional func-

    tions on top. Aurora simplex operationoffers the most flexibility and the leastresource use where users require highthroughput in one direction. Simplex isideal for non-data communication likestreaming video.

    Performance StatisticsAurora is designed to use minimal FPGAresources. A single-lane 2-byte framingdesign consumes approximately 383 look-up tables and 374 flip-flops. The curves inFigure 6 show the resource utilization sta-tistics for different lane configurations on aVirtex-5 LXT device.

    The latency for a single-lane2-byte design is 10 cycles.Transceivers insert some inher-ent latency. Thus, the overalllatency is approximately 29cycles for an Aurora design on aRocketIO GTP transceiver.

    You have the freedom tochoose reference clocks from arange of values. The overallthroughput depends on thereference clock value and theline rate chosen.

    Aurora: CORE Generator IP Aurora is released as a part ofCORE Generator software,

    with about 10 configurable parameters. Youcan configure an Aurora core by selectingstreaming/framing interface, simplex/full-duplex data flow, single/multiple MGTs,reference clock value, line rate, MGT loca-tion(s) based on number of MGTs selected,and reference clock source. The supportedline rate can range from 0.1 Gbps to 6.5Gbps depending on the Virtex deviceselected. The core comes with simulationscripts supporting MTI, NC-SIM, andVCS simulators and build scripts to easedesign synthesis and bit generation.

    ConclusionAurora is easy to use. The IP is efficient,scalable, and versatile enough to transportlegacy protocols and implement proprietaryprotocols. Aurora is also backward-compat-ible: an Aurora design in a Virtex-5 devicecan talk to an Aurora design in a Virtex-4device, making the IP independent ofunderlying MGT technologies. Xi