15
WP363 (v1.0) April 19, 2010 www.xilinx.com 1  © 2010 Xilin x, Inc. XIL INX, the X ilinx logo , Virtex, S partan, ISE, and other designat ed brands incl uded herein are trademarks of Xilinx in th e United St ates and othe r countries. PCI, PCIe, and PCI Express are trademarks of PCI-SIG and used under license. All other trademarks are the property of their respective owners.  This white paper discusses the observed performance of the Spartan®-6 FPGA Connectivity targeted reference design. The design uses PCI Express® , Ethernet, and an integrated memory controller along with a packet DMA for moving data  between syst em memory and the FPGA. The design demonstrates these applications: Ne twork i nt erface ca rd (mov ement o f E ther ne t frames between system memory and Tri-Mode Ethernet MAC (TEMAC)) Sy st em me mo ry to ex te rnal memo ry (DDR3 SDRAM) data transfer The goal of the targeted reference design is to achieve maximum performance on the PCIe® and Ethernet links. White Paper: Spartan-6 FPGAs WP363 (v1.0) April 19, 2010 Spartan-6 FPGA Connectivity Targeted Reference Design Performance  By: Sunita J ain

wp363

Embed Size (px)

Citation preview

Page 1: wp363

7/29/2019 wp363

http://slidepdf.com/reader/full/wp363 1/14

WP363 (v1.0) April 19, 2010 www.xilinx.com 1

 © 2010 Xilinx, Inc. XILINX, the Xilinx logo, Virtex, Spartan, ISE, and other designated brands included herein are trademarks of Xilinx in the United States and other countries.PCI, PCIe, and PCI Express are trademarks of PCI-SIG and used under license. All other trademarks are the property of their respective owners.

 

This white paper discusses the observed

performance of the Spartan®-6 FPGA Connectivitytargeted reference design. The design uses

PCI Express®, Ethernet, and an integrated memory

controller along with a packet DMA for moving data

 between system memory and the FPGA.

The design demonstrates these applications:

• Network interface card (movement of Ethernetframes between system memory and Tri-ModeEthernet MAC (TEMAC))

• System memory to external memory (DDR3SDRAM) data transfer

The goal of the targeted reference design is to

achieve maximum performance on the PCIe® and

Ethernet links.

White Paper: Spartan-6 FPGAs

WP363 (v1.0) April 19, 2010

Spartan-6 FPGA Connectivity Targeted Reference Design Performance 

 By: Sunita Jain

Page 2: wp363

7/29/2019 wp363

http://slidepdf.com/reader/full/wp363 2/14

2 www.xilinx.com WP363 (v1.0) April 19, 2010

About the Targeted Reference Design

About the Targeted Reference Design

The Spartan-6 FPGA Connectivity targeted reference design demonstrates the keyintegrated components in a Spartan-6 FPGA, namely, the endpoint block forPCI Express, the GTP transceivers, and the memory controller. The targeted referencedesign integrates these components in an application along with additional IP cores,including a third-party packet direct memory access (DMA) engine (from Northwest

Logic), the Xilinx® Platform Studio LocalLink Tri-Mode Ethernet MAC (XPS-LL-TEMAC), and a Xilinx memory controller block (MCB) delivered through the Xilinxmemory interface generator (MIG) tool. A block diagram of the targeted referencedesign is shown in Figure 1.

This design is a v1.1-compliant x1 Endpoint block for PCI Express showcasing theseindependent applications:

• A network interface card (NIC) that enables movement of Ethernet packets between the TCP/IP stack running on the host and the external network via theFPGA. The NIC provides two modes of operation:

• GMII mode using external Ethernet PHY (typically used to connect to copperEthernet networks)

X-RefTarget - Figure 1

Figure 1:  Design Block Diagram

Packet

DMA(32-bit)

   C   2       S

       S   2   C

   C   2       S

       S   2

   C

TargetInterface

  x   1   L   i  n   k   f  o  r

    P   C   I   E  x  p  r  e     s     s

Third Party IP FPGA Logic

       3   2  -       b   i   t   T  r     a  n     s     a  c   t   i  o  n   I  n   t  e  r   f     a  c  e   @

    6   2 .   5

   M   H  z

DMARegisterInterface

VirtualFIFOLayer

DMADriver(Linux)

BlockdataDriver

(Linux)

EthernetDriver(Linux)

GUI

MIG UserInterface

@62.5 MHz

User Space Registers

ControlPlaneBridge

DMA to

TEMACBridge

TEMAC toDMA

Bridge

[email protected] MHz 

1000BASE-X

GMII@125 MHz

16-bitDDR3@400 MHz

32-bitLocalLink

@62.5 MHz

[email protected] MHz

[email protected] MHz

UserData

UserStatus

   G   T   P   T  r     a  n     s  c  e   i  v  e  r     s

  x   1

   E  n   d  p  o   i  n   t   B   l  o  c   k   f  o  r   P   C   I   E  x  p  r  e     s     s

  v   1 .   1

   W  r     a  p  p  e  r   f  o  r   P   C   I   E  x  p  r  e     s     s

Xilinx IPIntegrated Blocks On SP605

MemoryController

Block

   M   I   G   W  r     a  p  p  e  r

       S   D   R   A   M

PLBv46Master

XPS-LLTEMAC

PLBv46Slave

   G   T   PPCS

PMA S   F   P

MarvellPHY

TCPStack

   R   J   4   5

WP363_01_040910

HardwareSoftware

Page 3: wp363

7/29/2019 wp363

http://slidepdf.com/reader/full/wp363 3/14

Performance of PCI Express

WP363 (v1.0) April 19, 2010 www.xilinx.com 3

• 1000BASE-X mode using GTP transceivers (typically used to connect tooptical fiber Ethernet networks)

• An external memory interface over PCI Express that enables data movement between host memory and external DDR3 SDRAM via the FPGA.

Fully independent transmit and receive path implementations coupled withprefetching buffer descriptors from packet DMA help maximize performance. Formore details on the targeted reference design, refer to the Spartan-6 FPGA ConnectivityTargeted Reference Design User Guide [Ref 1].

A brief description of the data flow in the design is provided here for both networkand block data paths.

Network Data Flow

Ethernet packets prepared by the TCP/IP stack are queued up for direct memoryaccess (DMA) by the software driver. The packets are transferred to the FPGA overPCIe and then routed to the Ethernet MAC for transmission on the network. Packetsreceived from the external network by the Ethernet MAC are moved into systemmemory through PCIe, and the software driver passes them onto the TCP/IP stack.

Block Data FlowData to external memory is written in blocks. Software generates the data and queuesit up for DMA transfer. The block data is then written to external memory, read backfrom external memory, and sent back to system memory through PCIe.

Test SetupThe test setup uses the Spartan-6 FPGA Connectivity Kit as the hardware andsoftware platform, providing the SP605 board and the targeted reference design aswell as the Fedora 10 LiveCD OS.

The systems used for the performance measurement have these specifications:

• Intel x58 2.67 GHz motherboard, 3 GB RAM, 256 KB L2 Cache, 8,192 KB L3 Cache,hyper-threaded eight-CPU server machine running Fedora 10 LiveCD 2.6.27.5.(MPS = 256B and MRRS = 512B for the PCIe slot used.)

• ASUS computer with single hyper-threaded 1 GHz Intel Pentium 4 processor,1 MB cache, two-CPU desktop machine running Fedora 10 LiveCD 2.6.27.5.(MPS = 128B and MRRS = 512B for the PCIe slot used.)

• Dell machine with 1.80 GHz Intel Pentium dual CPU running Fedora Core 62.6.18. (MPS = 128B, MRRS = 512B for the PCIe slot used.)

Performance of PCI Express

PCI Express is a serialized, high-bandwidth, scalable point-to-point protocol thatprovides highly reliable data transfer operations. The maximum transfer rate of a v1.1compliant PCIe core is 2.5 Gb/s. This rate is the raw bit rate per lane per direction andnot the actual data transfer rate. The effective data transfer rate is lower due toprotocol overhead and other system design trade-offs [Ref 2].

Theoretical PCI Express Performance EstimateThe maximum line rate supported by the PCIe protocol specification v1.1 is 2.5 Gb/s.This effectively becomes 2 Gb/s after accounting for 8B/10B encoding overhead.

Page 4: wp363

7/29/2019 wp363

http://slidepdf.com/reader/full/wp363 4/14

4 www.xilinx.com WP363 (v1.0) April 19, 2010

Performance of PCI Express

The following contribute to the the protocol overhead associated with payloadtransfer:

• The transaction layer adds 12 bytes of header (considering only 32-bit addressabletransactions without the optional end-to-end cyclic redundancy check (ECRC)).

• The data link layer adds 2 bytes of sequence number and 4 bytes of link layercyclic redundancy check (LCRC).

• The physical layer adds 1 byte of start-framing and 1 byte of end-framinginformation.

The protocol overhead is illustrated in Figure 2. For more information, refer toUnderstanding Performance of PCI Express Systems [Ref 2]).

The PCIe protocol also uses additional data link layer packets for flow control updatesand ACK/NAK to help in link management. Neither is accounted for in thisperformance estimation.

Note: The following estimation does not represent achievable bandwidth, because it assumesthat all lanes are transmitting transaction layer packets (TLPs) at all times without any gaps orphysical or data link layer overhead. Therefore, it only provides an indication of the upperperformance boundary.

Equation 1 and Equation 2 define the throughput estimate and efficiency, respectively

Equation 1

Equation 2 

An estimate of PCIe performance is shown in Table 1 with values obtained usingEquation 1 and Equation 2 for various payload sizes.

Figure 3 shows that protocol overhead prevents the throughput from reaching thetheoretical maximum. Beyond a payload size of 512 bytes, there is no significant gainin performance.

X-RefTarget - Figure 2

Figure 2:  PCI Express Protocol Packet Overhead

Table 1:  PCIe Performance (Theoretical Estimate)

Payload Size (Bytes) Throughput (Gb/s) Efficiency (%)

64 1.52 76.19

128 1.72 86.48

256 1.85 92.75

512 1.92 96.24

1024 1.96 98.08

2048 1.98 99.03

PHY

Start

1 Byte

Sequence

2 Bytes

Header

12–16 Bytes

Payload

0–4096 Bytes

ECRC

4 Bytes

LCRC

4 Bytes

End

1 Byte

Data LinkLayer PHY

Data LinkLayer

TransactionLayer

WP363_02_040910

Throughput EstimatePayload

Payload 12 6 2+ + +

---------------------------------------------------- 2×=

EfficiencyPayload

Payload 12 6 2+ + +

---------------------------------------------------- 100×=

Page 5: wp363

7/29/2019 wp363

http://slidepdf.com/reader/full/wp363 5/14

Performance of PCI Express

WP363 (v1.0) April 19, 2010 www.xilinx.com 5

Additionally, system parameters like maximum payload size, maximum read requestsize, and read completion boundary impact PCIe performance. For a detailedunderstanding of various factors impacting the PCIe link performance, refer to

Understanding Performance of PCI Express Systems [Ref 2].

PCI Express Performance ResultsThis section presents the PCI Express performance results as seen on the memory pathusing the block data transfer to external memory. The Ethernet application (networkpath) provided in the design supports a maximum throughput of 1 Gb/s. Thus, blockdata transfer is used to maximize PCIe performance, which is measured against thesemetrics:

• DMA throughput: This includes a count of actual payload bytes available at theDMA user interface.

• PCIe transmit utilization: This includes utilization of the TRN-TX interface.

• PCIe receive utilization: This includes utilization of the TRN-RX interface.

The PCIe transmit and receive transaction utilization metrics include overhead due toDMA buffer descriptor management and register setup, in addition to data payloadthroughput. These metrics are the total of all PCIe transaction utilizations necessary toachieve the observed DMA throughput.

Counters in hardware are used to gather the above metrics. The DMA throughputestimate is provided by the counters present inside the DMA. For PCIe transactionutilization measurement, counters snooping over the TRN-TX and TRN-RX interfacesare implemented. These counters map to user space registers, which are periodicallyread by the software driver to get a real-time performance estimate.

Performance Variation with Maximum Payload Size

Figure 4 shows performance variation with change in maximum payload size (MPS)on a PCIe link for a fixed block size of 1,024 bytes. MPS is one of the many factorsimpacting performance.

X-RefTarget - Figure 3

Figure 3:  Theoretical PCI Express Throughput

64

2.5

1.5

2

0.5

1

0

128 256 512

Payload Size in Bytes

2 Gb / sTheoretical Maximum

   T   h  r  o      u

  g   h  p      u   t   i  n   G        b   /      s

1024 2048

WP363_03_040810

Page 6: wp363

7/29/2019 wp363

http://slidepdf.com/reader/full/wp363 6/14

6 www.xilinx.com WP363 (v1.0) April 19, 2010

Performance of PCI Express

From Figure 4, it can be seen that:

• Higher MPS improves performance, because more data can be sent whencompared to TLP header overhead for a given transaction.

• The DMA throughput plotted is similar for both transmit and receive directions.The PCIe transaction utilization includes DMA setup overhead and registerread/write operations. Thus, the throughput observed is higher than the actualpayload throughput on DMA.

• PCIe receive utilization (downstream traffic due to reads) is slightly higher thanPCIe transmit utilization (upstream traffic due to writes) due to the fact that anadditional descriptor fetch of 32 bytes is counted in the PCIe receive direction ascompared to a 4-byte or 12-byte descriptor update in the PCIe transmit direction.

The memory path loops back data—that is, it reads data from system memory, writesit to external memory, and then reads data back from external memory and writes it tosystem memory. This essentially indicates that the performance is controlled by thePCIe read transactions, because they are slower than the write transactions. The dataavailable through reads is written back to the system memory. Because the data forthis test is generated by the software driver itself, care is taken to generate data bufferaddresses aligned to system page boundaries to optimize performance.

Performance Variation with Data Block Size

As data block size increases, throughput improves. This can be seen in Figure 5. Theresults shown are obtained with MPS = 256 bytes.

The test is set up so that one data block is defined by one descriptor to reflect the trueimpact of data block size. If one descriptor had defined a fixed block size (e.g.,

256 bytes), then fetching a 256-byte data block would result in a fetch of one descriptor(the result of just one read), while fetching a 1,024-byte data block would result in afetch of four descriptors (the result of four reads). The performance would, in the lattercase, be characteristic of an aggregate 256-byte data block size rather than a true1,024-byte block size.

X-RefTarget - Figure 4

Figure 4:  Performance Variation with MPS

1.6

0.8

1.2

1.4

1

0.2

MPS = 128 Bytes MPS = 256 Bytes

0.4

0.6

0

   T   h  r  o      u  g

   h  p      u   t   i  n   G        b   /      s

WP363_04_040810

DMA Throughput PCIe TransmitUtilization

1.103

1.321 1.301

1.4971.404

1.518

PCIe ReceiveUtilization

Page 7: wp363

7/29/2019 wp363

http://slidepdf.com/reader/full/wp363 7/14

Ethernet Performance

WP363 (v1.0) April 19, 2010 www.xilinx.com 7

The PCIe utilization plots in Figure 5 also indicate additional overhead required to setup data transfers using DMA. The DMA throughput measures the actual payloadtransferred between system memory and hardware. The PCIe utilization in the receivedirection is more than in the transmit direction, as can be seen in Figure 5. This is because an additional descriptor fetch of 32 bytes is counted in the PCIe receivedirection, as compared to a 4-byte or 12-byte descriptor update in the PCIe transmitdirection. This per-packet difference is seen more for smaller block sizes than for larger block sizes.

Ethernet Performance

The raw line rate of the Ethernet link is 1.25 Gb/s. This is commonly referred to asGigabit Ethernet after accounting for encoding overhead. The 1000BASE-X link uses8B/10B encoding, while the 1000BASE-T link uses 4D PAM-5 encoding. The actualdata rate thus comes out to be 1,000 Mb/s [Ref 3].

Theoretical Ethernet Performance EstimateThe performance of various Ethernet applications at different layers is less than thethroughput seen at the software driver and the Ethernet interface. This is due to thevarious headers and trailers inserted in each packet by the layers of the networkingstack. Ethernet is only a medium to carry traffic. Various protocols, such astransmission control protocol (TCP) and user datagram protocol (UDP) implement

their own header/trailer formats.To estimate theoretical performance, consider raw Ethernet traffic not accounting fornetwork protocol overheads, and traffic accounting for TCP. The protocol headerincludes:

• TCP/IP Overhead: 20-byte TCP header + 20-byte IP header

• Ethernet Overhead: 14-byte Ethernet header + 4-byte trailer (as a frame checksequence)

Additionally, there are 8 bytes of preamble with a start-of-frame delimiter and12 bytes of interframe gap on the wire.

X-RefTarget - Figure 5

Figure 5:  Throughput Variation with Data Block Size

256

2

1.5

0.5

1

0

512

DMA Transmit Throughput (Gb / s)

PCIe Transmit Utilization (Gb / s)

DMA Receive Throughput (Gb / s)

PCIe Receive Utilization (Gb / s)

Data Block Size in Bytes

1.683

1.521

1.683

1.521

1.518

1.321

1.044

0.829

0.849

0.589   T   h  r  o     u  g

   h  p     u   t   i  n   G       b   /     s

1024 2048 4096

WP363_05_040810

Page 8: wp363

7/29/2019 wp363

http://slidepdf.com/reader/full/wp363 8/14

8  www.xilinx.com WP363 (v1.0) April 19, 2010

Ethernet Performance

Based on the above information, the throughput calculations can be formulated asshown in Equation 3 and Equation 4, where Payload is the Ethernet frame payload.

Equation 3 

Equation 4 

Table 2 shows the theoretical raw Ethernet throughput (without network protocoloverhead) and TCP/IP throughput, which does account for network protocoloverhead [Ref 4].

Note: This computation is purely for Ethernet and does not account for the PCIe protocol orDMA management overhead.

A plot of the theoretical throughput in Figure 6 shows that protocol overhead preventsthe gigabit Ethernet NIC from even theoretically approaching the wire speeds. Theprotocol overhead negatively impacts the throughput for payload sizes less than512 bytes.

Table 2:  Theoretical Ethernet Throughput Calculation

Payload Size(Bytes)

Ethernet Traffic Throughput (Mb/s)

Raw TCP/IP

46 547.61 370.96

64 627.45 450.70

128 771.08 621.35

256 870.74 766.46

512 930.90 867.79

1,024 964.21 929.21

1,442 974.32 948.68

1,472 974.83 949.67

1,496 975.22 950.44

X-RefTarget - Figure 6

Figure 6:  Raw Traffic Vs TCP/IP Traffic Theoretical Throughput

Raw ThroughputPayload

Payload 18 20+ +

---------------------------------------------- 1000×=

TCP/IP ThroughputPayload

Payload 18 20 40+ + +

---------------------------------------------------------- 1000×=

46 64

1000

700

800

900

600

200

300

100

500

400

0

128 256 512

Ethernet Throughput with Raw Traffic Ethernet Throughput with TCP/IP Traffic

Payload Size in Bytes

   T   h  r  o     u  g

   h  p     u   t   i  n   M       b   /     s

1024 1442 1472 1496

WP363_06_040910

Page 9: wp363

7/29/2019 wp363

http://slidepdf.com/reader/full/wp363 9/14

Ethernet Performance

WP363 (v1.0) April 19, 2010 www.xilinx.com 9

Ethernet Performance ResultsThis section describes the Ethernet performance results as observed with Netperf.

Test Methodology

Netperf v2.4 was used as the testbench for measuring Ethernet performance [Ref 5].Netperf is a data transfer application running on top of the TCP/IP stack, and it

operates around a basic client-server model. The client could be configured fordifferent message sizes and to open a TCP or UDP connection. After establishing aconnection, either the client or server (depending on the direction in whichperformance is being measured) transmitted a stream of full-size 1,514-byte packets.Only ACK packets were sent in the opposite direction.

For performance measurement, a socket size of 16 KB was used. The two systems wereconnected in a private LAN connection to avoid other external LAN traffic.

For Ethernet performance measurement, the setup consisted of two PC machinesconnected in a private LAN environment, as shown in Figure 7. The Spartan-6 FPGAConnectivity Kit’s NIC was hosted on an ASUS desktop machine.

To measure the CPU utilization as accurately as possible, no other applications (otherthan the standard ones) were run during the test. The measurements shown wereobtained with the Netperf TCP Stream test. This test measures unidirectional

transmission of TCP data and does not include connection establishment time. Thismeasurement indicates performance for processing outbound packets, i.e., from theNetperf client running on the host PC towards the net server running on a remote PC

Throughput Analysis

Throughput was measured while varying the message size in Netperf. The maximumtransmission unit (MTU) was kept at 1,500 bytes, and a socket size of 16,384 bytes wasused.

The message size is the send buffer size, which indirectly impacts the size of thepackets sent over the network. MTU size decides the maximum size of a frame thatcan be sent on the wire. Figure 8 depicts the Ethernet throughput measured foroutbound traffic.

X-RefTarget - Figure 7

Figure 7:  Ethernet Performance Measurement Setup

Private LAN

Standard PC with a

Commercial NIC

PC with an

SP605 NIC

WP363_07_040910

Page 10: wp363

7/29/2019 wp363

http://slidepdf.com/reader/full/wp363 10/14

10 www.xilinx.com WP363 (v1.0) April 19, 2010

Ethernet Performance

It can be deduced from Figure 8 that when the message size is 256 bytes or less, theTCP/IP protocol overhead has a larger impact and thus reduces the throughput. With

larger message sizes, the protocol overhead decreases, thereby increasing thethroughput. Beyond a message size of 1,024 bytes, throughput stabilizes at a value ofaround 936 Mb/s. The result also demonstrates that performance scales with variationin packet size.

Frames greater than 1,514 bytes are categorized as “jumbo” frames. The impact of  jumbo frames on throughput with increasing MTU size is shown in Figure 9. Theresults are reported for a message size of 8,192 bytes.

Thus, increasing MTU size improves performance because larger size packets can besent. This means that a packet can have more payload for the same header. Using alarge MTU size allows the operating system to send fewer packets of a larger size toreach the same data throughput.

Note: For higher MTU sizes, the highest performance seen is greater than 980 Mb/s,compared to approximately 936 Mb/s, as seen with an MTU of 1,500 bytes.

X-RefTarget - Figure 8

Figure 8:  Outbound Ethernet Throughput

X-RefTarget - Figure 9

Figure 9:  Frame Throughput (Outbound)

64

457.28

571.88

914.09 926.62 936.47 936.48 936.481000

700

800

900

600

200

300

100

500

400

0

128 256 512

Message Size in Bytes

   T   h  r  o     u  g

   h  p     u   t   i  n   M       b   /     s

1024 1492 4096 8192 16384 32768

WP363_08_040810

990

1000

950

970

980

960

920

930

940

910

   T   h  r  o     u  g

   h  p     u   t   i  n   M       b

   /     s

WP363_09_040810

1500 2048 4096 8192

936.48

952.13

976.13

987.89

MTU Size in Bytes

Page 11: wp363

7/29/2019 wp363

http://slidepdf.com/reader/full/wp363 11/14

Ethernet Performance

WP363 (v1.0) April 19, 2010 www.xilinx.com 11

CPU Utilization and Checksum Offload

Checksum is used to ensure the integrity of data during transmission. The transmittercalculates the checksum of data and transmits the data with its checksum. The receivercalculates the checksum using the same algorithm as the transmitter and compares thechecksum calculated against the checksum received in the data stream. If thecalculated and received checksums do not match, a transmission error occurs.

Figure 10 shows the results for TCP data payload checksum offload. (This designsupports only partial checksum offload for TCP/UDP packets.) As seen in the figure,checksum offload improves CPU utilization for large size packets.

Operations like checksum calculation, which are expensive when done in software onlarge packets, should thus be offloaded to hardware so that CPU time can beefficiently utilized elsewhere. With small packets, checksum computation does not

contribute as much to CPU utilization overhead as does the protocol overhead of using small packets.

Beyond a message size of 512 bytes, it can be seen that CPU utilization decreasesappreciably. For the same throughput, the number of TCP/IP packets processeddecreases with an increase in packet size. As a result, processing fewer packetsconsumes fewer CPU cycles.

On the Fedora 10 LiveCD Linux operating system (kernel version 2.6.27.5), it isobserved that enabling checksum offload results in the TCP/IP stack providingEthernet frames in fragments for transmission, i.e., one frame is split across multiple buffer descriptors. The results described here indicate that the performanceimprovement gained by checksum offload is offset by the fragmentation of frames. No

significant improvement in throughput is seen with checksum offload.

Throughput Variation with Descriptor Count

Figure 11 shows throughput variation with descriptor count per descriptor ring. EachDMA engine has one descriptor ring. With a smaller number of descriptors in the ring,transmit performance is low, because only a small number of packets can be queuedfor transmission. Similarly, reduced descriptor count in the receive ring drains theincoming packets at a slower rate. It is also seen that with smaller descriptor counts, alarger number of retransmissions occur, which explains the reduced inboundthroughput. With fewer descriptors, the software overhead required to manage

X-RefTarget - Figure 10

Figure 10:  Utilization with Checksum Offload (Outbound Traffic)

20

5

CPU Utilization without Checksum Offload CPU Utilization with Checksum Offload

10

15

0

1000

800

200

400

600

0

   C

   P   U   U   t   i   l   i  z     a   t   i  o  n

   %

   T   h

  r  o     u  g

   h  p     u   t   i  n   M       b   /     s

WP363_10_040910

64 128 256 512 1024 1494

Message Size in Bytes

4096 8192 16384 32768

457.28574.88

914.09 926.62 936.47 936.48Throughput

Page 12: wp363

7/29/2019 wp363

http://slidepdf.com/reader/full/wp363 12/14

12 www.xilinx.com WP363 (v1.0) April 19, 2010

Ethernet Performance

descriptors is higher. Thus, the DMA drains packets at a slow rate from the EthernetMAC. This causes the receive buffer in the Ethernet MAC to overflow, resulting indropped packets. These dropped packets are responsible for the retransmissions.

Beyond a descriptor count of 512, the performance remains stable because enoughdescriptors are available in both transmit and receive directions.

Interrupt Mode PerformanceThe software driver supports both interrupt and polled modes of operation. Forinterrupt mode operation, the processor is interrupted on a packet-count basis insteadof on a per-packet basis. This count is known as the coalesce count.

Figure 12 shows the interrupt mode performance as seen with legacy interruptsobtained by varying the coalesce count.

Figure 12 shows that performance deteriorates slightly with increasing coalesce countvalues. This is attributed to the fact that larger coalesce counts result inretransmissions due to network protocol timeouts because packets wait longer inqueue to be processed.

X-RefTarget - Figure 11

Figure 11:  Throughput Variation with Descriptor Count

100

200

Outbound Throughput Inbound Throughput

600

400

800

0

   T   h  r  o     u  g

   h  p     u   t   i  n   M       b   /     s

WP363_11_040810

64 128 256 512 1024 2048

Descriptor Count per Descriptor Ring

4096

718.09827.4

931.13 935.69

556.16

938.92 940.84 940.66

X-RefTarget - Figure 12

Figure 12:  Throughput and CPU Utilization in Interrupt Mode (Outbound Traffic)

935

936

937

931

Outbound Throughput CPU Utilization (Outbound Traffic)

933

932

934

930

10

8

2

4

6

0

   T   h  r  o     u  g

   h  p     u   t   i  n   M       b   /     s

   C   P   U   U   t   i   l   i  z     a   t   i  o  n

   %

WP363_12_040810

8 16 32 64 128 256

Coalesce Count

935.9

935.2

934.04

933.45

932.37 932.28

Page 13: wp363

7/29/2019 wp363

http://slidepdf.com/reader/full/wp363 13/14

Conclusion

WP363 (v1.0) April 19, 2010 www.xilinx.com 13

It can also be observed that CPU utilization is high for lower coalesce count values.Smaller coalesce count values indicate frequent interrupts.

Comparing polled-mode performance against interrupt-mode performance, it can beseen that CPU utilization for a coalesce count of 32 is 4.85% as compared to 6.02%measured in polled mode, but with a reduction in throughput (934.04 Mb/s versus936.48 Mb/s in polled mode). On the PCIe link, the interrupts move upstream as TLPsconsume link bandwidth. Although the amount of bandwidth consumed is small, it is

still sufficient to impact the throughput as credits are consumed. There are nointerrupts in polled mode, but a timer in the software driver frequently polls the statusof various descriptor rings. CPU utilization is thus higher in polled mode than ininterrupt mode.

SummaryA summary of Ethernet performance is shown in Figure 13.

The Spartan-6 FPGA Connectivity Kit’s NIC performance is measured against acommercial NIC so that true performance is reflected. The inbound performanceclearly indicates that the Xilinx NIC can handle commercial NIC throughput. Theoutbound performance for the Xilinx NIC is lower when compared to the commercialNIC.

Conclusion

This white paper summarizes performance as measured on the Spartan-6 FPGA

Connectivity targeted reference design available with the Spartan-6 FPGAConnectivity Kit. The results clearly demonstrate that:

• Throughput scales with variation in packet size.

• PCI Express performance improves with higher maximum payload size.

• CPU utilization improves with checksum offloading.

• An inverse relationship exists between packet size and CPU utilization.

The Spartan-6 FPGA Connectivity targeted reference design provides a high-quality,fully validated, and performance-profiled system that can help jump-start end userdevelopment, thereby reducing time to market.

X-RefTarget - Figure 13

Figure 13:  Ethernet Performance Summary

944

936

940

942

938

930

Outbound Throughput Inbound Throughput

932

934

928

   T   h  r  o      u  g   h  p      u   t   i  n   M        b   /

      s

WP363_13_040810

Xilinx SP605 NIC in

Polled Mode

Xilinx SP605 NIC in

Interrupt Mode

936.48

941.34

932.65

941.34 941.43 941.34

Commercial NIC

Page 14: wp363

7/29/2019 wp363

http://slidepdf.com/reader/full/wp363 14/14

References

References

1. UG392, Spartan-6 FPGA Connectivity Targeted Reference Design User Guide

2. WP350, Understanding Performance of PCI Express Systems

3. Introduction to Gigabit Ethernet, Cisco Technology Digesthttp://www.cisco.com/en/US/tech/tk389/tk214/tech_brief09186a0080091a8a.html

4. Gigabit Ethernet Performance in rp7410 Servers, Hewlett-Packard Performance Paper,September 2002http://docs.hp.com/en/2176/GigE_rp7410_perf.pdf 

5. Netperf documentationhttp://www.netperf.org

Additional Information

1. Small Traffic Performance Optimization for 8255x and 8254x Ethernet Controllers, IntelApplication Note AP-453, September 2003http://download.intel.com/design/network/applnots/ap453.pdf 

2.  HP-UX IPFilter Performance, Hewlett-Packard White Paperhttp://docs.hp.com/en/13758/ipfiltperf.pdf 

3. 1000BASE-T Delivering Gigabit Intelligence, Cisco Technology Digesthttp://www.cisco.com/en/US/tech/tk389/tk214/tech_digest09186a0080091a86.html

Revision History

The following table shows the revision history for this document:

Notice of Disclaimer

The information disclosed to you hereunder (the “Information”) is provided “AS-IS”with no warranty of any kind, express or implied. Xilinx does not assume any liabilityarising from your use of the Information. You are responsible for obtaining any rightsyou may require for your use of this Information. Xilinx reserves the right to makechanges, at any time, to the Information without notice and at its sole discretion. Xilinxassumes no obligation to correct any errors contained in the Information or to adviseyou of any corrections or updates. Xilinx expressly disclaims any liability inconnection with technical support or assistance that may be provided to you in

connection with the Information. XILINX MAKES NO OTHER WARRANTIES,WHETHER EXPRESS, IMPLIED, OR STATUTORY, REGARDING THEINFORMATION, INCLUDING ANY WARRANTIES OF MERCHANTABILITY,FITNESS FOR A PARTICULAR PURPOSE, OR NONINFRINGEMENT OF THIRD-PARTY RIGHTS.

Date Version Description of Revisions

04/19/10 1.0 Initial Xilinx release.