13
©By Roy Messinger 1 Xilinx(Ultrascale) Vs. Altera(ARRIA 10) Test Bench By Roy Messinger www.HWDebugger.com [email protected]

©By Roy Messinger Xilinx(Ultrascale) Vs. Altera(ARRIA 10

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ©By Roy Messinger Xilinx(Ultrascale) Vs. Altera(ARRIA 10

©By Roy Messinger

1

Xilinx(Ultrascale)

Vs.

Altera(ARRIA 10)

Test Bench

By Roy Messinger

www.HWDebugger.com

[email protected]

Page 2: ©By Roy Messinger Xilinx(Ultrascale) Vs. Altera(ARRIA 10

©By Roy Messinger

2

1 GENERAL

In the following document I will show a thorough comparison I've conducted between 2

FPGA's of vendor's families; Altera ARRIA 10 & Xilinx UltraScale Kinetis.

The comparison put emphasis on frequency, utilization, power & compilation time. I've

carried out this comparison in an attempt to find the 'best' vendor suited for my needs. I did

not give any 'discounts' to this or that vendor. All the tests I've conducted were purely

identical in term of exactly the same code and software preferences.

See important notes at last page for further info.

2 WHAT I'VE CHECKED WAS:

• Frequency.

• Utilization.

• Thermal power.

• Compilation time.

3 FPGA COMPONENTS

I’ve chosen these FPGA’s to compare two similar components, in term of RAM, size, and

various other characteristics.

Component System Logic [k]

RAM [Mb]

PCI-Gen 3 Transcv I/O

Altera GX480, (10AX048K1F35E1HG)

629 28 2*8 lanes 36 396

Xilinx KU035 (XCKU035-1FFVA1156C)

444 25 2*8 lanes 16 520

Page 3: ©By Roy Messinger Xilinx(Ultrascale) Vs. Altera(ARRIA 10

©By Roy Messinger

3

4 TEST BENCH METHODOLOGY

How did I carry out the comparison?

• For the comparison I have used a VHDL component of a state machine (about 20

states). This FSM implements some heavy logic and runs at 400MHz.

• I've designed 2 small projects of only this component, both in Altera (Quartus) &

Xilinx (Vivado).

• After each successful compilation, I've checked the timing analysis and replicated

the component to push the FPGA capabilities to the edge (space, frequency).

• I've used virtual pins on all comps so no need to connect the comp ports to the FPGA

pins (no connection to IO buffers).

• I did not alter anything in each of the softwares. I've left the default values of

implementation/synthesis setting as they were.

Compile in Vivado & Quartus Passes

timing req.?

Yes

Replicate

No

Replicate

component

Compare to

second vendor.

Virtual pins

FPGA

Comp.

Page 4: ©By Roy Messinger Xilinx(Ultrascale) Vs. Altera(ARRIA 10

©By Roy Messinger

4

5 TEST BENCH HARDWARE

• Compilation computers (both with Windows 7 OS):

o Altera:

▪ Quartus version 17.0.0.

▪ E5-2643 @3.4GHz (Xeon), 32GB RAM.

o Xilinx:

▪ Vivado version 2016.4.

▪ I7-6700 @3.4GHz , 32GB RAM.

• Component chosen were close to the same spec (to what I need):

o Altera: 10AX048K1F35E1HG; GX480, highest speed grade.

o Xilinx: XCKU035-1FFVA1156C; KU035, highest slowest speed grade (see

notes at last page).

o Both comps are the same package dimension (35mm*35mm).

Page 5: ©By Roy Messinger Xilinx(Ultrascale) Vs. Altera(ARRIA 10

©By Roy Messinger

5

6 TEST RESULTS

I've ran 3 sets of tests. I've defined them as Test A, Test B, Test C.

• This is NOT a real design, but one that can compare the performances between

both vendors as it uses a real component and simulates HW FPGA development

phases. The code is the same.

• Test A & Test B are closer to a real world implementation in my point of view, as it

defines relations between different instantiations inside the FPGA.

• Test B is intended to push the FPGA to the edge, in term of frequency, as both

vendors do not reach this frequency but are supposed to do their best effort.

• I've also implemented Test C to ease the vendors Synthesis, Optimizations & Place

& Route phases and see what happens then, when there's no relation between

different instantiations.

• The frequency comparison is between the WNS in Vivado (Worst Negative Slack,

it's the worse of the worst) and max frequency result in Quartus, which is based on

the setup timing in 100c of the timing report (it is the worse of the worst).

• Both vendor tools have the default preferences (no 'best efforts', etc.).

Inst

. 1

Inst

. 2

Inst

. 3

Inst.

24

Test A, 400MHz:

Each input is connected to all

instantiations, as shown.

Internal Outputs, obviously,

are separated:

Test B, 500MHz:

Each input is connected to all

instantiations, as shown.

Outputs, obviously, are

separated:

Inst

. 1

Inst

. 2

Inst

. 3

Inst.

24

2 Clocks are created for the design in SDC (Quartus) & XDC (Vivado); 100MHZ & 400MHz/500MHz

Test C, 400MHz:

Each input is connected to

each instantiation, as shown.

Outputs, obviously, are

separated:

Inst

. 1

Inst

. 2

Inst

. 3

Inst.

24

Page 6: ©By Roy Messinger Xilinx(Ultrascale) Vs. Altera(ARRIA 10

©By Roy Messinger

6

Test A (at 400MHz):

Page 7: ©By Roy Messinger Xilinx(Ultrascale) Vs. Altera(ARRIA 10

©By Roy Messinger

7

These are the results for 400MHz:

General Notes & conclusions for Test A:

a. The same VHDL component was used with exact same parameters The

code is the same.

b. Compilation times of Vivado (Xilinx) were 20% faster than Quartus.

c. Frequency column values above 400MHz shows the maximum frequency

achieved, even though not required.

d. Ultrascale(Xilinx) slope is much more stable and linear than ARRIA 10(Altera),

and keeps steady slope above the 400MHz target frequency until it cannot

hold on.

In continuous to section C., I've now compared both projects in 500MHz, where

even though both vendors cannot reach such high frequency, they will tend to do

their best effort to reach the highest frequency they can.

Max. Frequency [MHz]

Desired freq. Replicated

Components

Altera Xilinx

ARRIA 10

ULTRA-SCALE

400 4 430 423

400 5 433 413

400 7 417 409

400 8 395 411

400 9 433 414

400 10 403 414

400 11 419 411

400 12 383 411

400 13 401 411

400 14 389 410

400 15 420 409

400 16 409 409

400 17 402 410

400 18 370 412

400 19 316 417

400 20 383 420

400 25 362 411

400 30 364 416

400 35 315 410

400 37 315 411

400 40 315 387

400 45 330 392

Page 8: ©By Roy Messinger Xilinx(Ultrascale) Vs. Altera(ARRIA 10

©By Roy Messinger

8

Test B (at 500MHz):

Page 9: ©By Roy Messinger Xilinx(Ultrascale) Vs. Altera(ARRIA 10

©By Roy Messinger

9

These are the results for 500MHz:

General Notes & conclusions for Test B:

a. Both vendors could not reach 500MHz, nevertheless, Ultrascale managed to be way over ARRIA 10 in terms of frequency, space and

compilation time.

b. Regarding logic elements usage, there's a fix value of 86% usage ratio between Xilinx logic usage and Altera logic usage (Xilinx usage is

lower than Altera). I've used Xilinx formulas to compare CLB(LUT)'s to ALM's.

c. ARRIA 10(Altera) vs. Ultrascale (Xilinx) usage logic ratio is kept fixed all along, showing both Altera and Xilinx replication algorithm

does not change, as the usage of logic elements is raising linear when replications increase which is a good thing when comparing

‘apples to apples'.

Desired freq.Replicated

components

Xilinx Achieved

frequency [MHz]

Altera Achieved

frequency [MHz]

Xilinx Utiization

[%]

Altera Utilization

[%]

Xilinx Utilization

[LUT]

Altera Utilization

[ALM]

Xilinx

Normalized

utilization

Altera

Normalizaed

Utilization

% Xilinx/Altera

usage

500 18 471 371 24.6 21 50,056 38,519 87,598 102,075 86

500 19 497 381 26 22.2 52,825 40,712 92,444 107,887 86

500 20 480 316 27.4 23.3 55,586 42,715 97,276 113,195 86

500 21 488 341 28.7 24.4 58,373 44,743 102,153 118,569 86

500 22 450 392 30.1 25.5 61,158 46,858 107,027 124,174 86

500 23 492 341 31.5 26.7 63,951 48,995 111,914 129,837 86

500 24 461 362 32.8 27.8 66,708 51,026 116,739 135,219 86

500 25 413 312 34.2 29 69,506 53,197 121,636 140,972 86

500 26 459 396 35.6 30.3 72,288 55,595 126,504 147,327 86

500 27 450 314 37 31.4 75,087 57,685 131,402 152,865 86

500 28 473 388 38.3 32.6 77,803 59,877 136,155 158,674 86

500 29 469 332 39.7 33.9 80,616 62,173 141,078 164,758 86

500 30 489 334 41.1 35.1 83,418 64,382 145,982 170,612 86

500 31 466 384 42.4 36.2 86,152 66,394 150,766 175,944 86

Page 10: ©By Roy Messinger Xilinx(Ultrascale) Vs. Altera(ARRIA 10

©By Roy Messinger

10

Test C (at 400MHz):

Page 11: ©By Roy Messinger Xilinx(Ultrascale) Vs. Altera(ARRIA 10

©By Roy Messinger

11

Desired freq.Replicated

components

Xilinx

Achieved

frequency

[MHz]

Altera

Achieved

frequency

[MHz]

Xilinx

Compilation time

Altera compilation

time

Xilinx

Utiization [%]

Altera Utilization

[%]

Xilinx Utilization

[LUT]

Altera Utilization

[ALM]

Xilinx

Normalized

utilization

Altera

Normalizaed

Utilization

Xilinx/Altera

utilization ratio

[%]

Power Dissipation

Xilinx [W]

Power Dissipation

Altera [W]

400 8 410 420 08:42 15:27

400 9 411 424 09:48 18:30

400 10 412 419 10:46 20:00

400 11 409 409 11:15 21:37

400 12 410 417 12:58 20:24

400 13 414 406 13:00 25:01

400 14 409 418 13:25 28:00

400 15 410 420 13:32 28:01

400 16 418 401 14:24 31:24

400 17 408 394 14:06 32:09

400 18 419 411 15:47 33:00

400 19 410 423 15:39 36:02

400 20 411 408 16:52 37:00

400 21 420 405 28:00 40:00 29 32 1.66 3.27

400 22 409 416 30:00 38:22 30 34 1.7 3.38

400 23 408 412 32:00 39:30 31 36 1.78 3.48

400 24 418 398 32:20 41:24 33 37 1.83 3.6

400 25 420 371 33:00 43:55 34 39 1.89

400 26 411 411 36:00 45:48 36 40 1.95 3.75

400 27 409 410 36:00 45:40 37 42 2 4

400 28 410 409 40:00 50:40 38 43 2 4

400 29 411 415 41:10 52:21 40 45

400 30 409 407 26:00 54:00 41 46 83,448 85,093 146,034 225,496 65 2.17 4.172

400 31 416 406 42:00 56:29 42 48

400 32 408 407 42:00 57:44 44 49 5.3

400 33 414 402 48:14 58:23 45 51 91,761 93,598 160,582 248,035 65 2.34 4.46

400 34 412 404 46:30 58:44 47 53

400 35 409 404 50:00 01:01:52 48 54

400 36 401 380 47:37 01:05:00

400 37 401 393 52:21 59:39

400 38 408 417 50:00 01:07:02

400 39 407 334 57:30 01:10:00 53 60 108,271 110,627 189,474 293,162 65 2.577 4.9

400 40 409 395 53:03 01:02:00

400 41 409 408 55:00 01:11:00 56 63 113,857 116,295 199,250 308,182 65 2.685

400 42 404 359 56:55 01:01:05

400 43 402 395 58:52 01:13:00 59 66 5.25

400 44 390 393 01:03:00 01:12:00 60 68 122,357 124,801 214,125 330,723 65 2.846

400 45 410 406 1:04:00 01:19:00 62 70 2.9

400 46 404 394 1:05:01 01:22:00 63 71 2.95 5.457

400 47 378 397 01:09:00 01:23:00 64 73 3.008 5.5

400 48 409 371 01:06:00 01:29:00 66 3.06

Though pwr dissipation not 'real'

because virtual pins are used, still,

the comparison between vendors is

'legal' as we can compare between

them.

Page 12: ©By Roy Messinger Xilinx(Ultrascale) Vs. Altera(ARRIA 10

©By Roy Messinger

12

General Notes & conclusions for Test C:

a. In this test, though less realistic in my point of view, both vendors can hold

more replications till they fail timing requirements. Nevertheless, ARRIA 10

(Altera) keeps failing at much earlier points than Ultrascale (Xilinx).

b. Xilinx Compilation times are about 20% faster than Altera.

c. Regarding logic elements usage, there's a fix value of 65% usage ratio

between Xilinx logic usage and Altera logic usage (Xilinx usage is lower than

Altera). I've used Xilinx formulas to compare LUT's to ALM's.

d. In this test I've also compared Thermal Power: Ultrascale consumes about

50% less power than ARRIA 10 (meaning less overall heat and power supply

current needed).

Page 13: ©By Roy Messinger Xilinx(Ultrascale) Vs. Altera(ARRIA 10

©By Roy Messinger

13

7 TEST RESULTS SUMMARY

So, overall:

A. When comparing Altera ARRIA 10 GX480, F35, to Xilinx UltraScale KU035,

A1156:

• Compilation time (Xilinx 20% less).

• Frequency (Xilinx were much more stable and higher freq.)

• Thermal power (Xilinx almost 50% less power).

• Utilization (Xilinx to Altera ratio 86%).

B. Even when I compared Altera’s GX320 to Xilinx’s KU035 (Altera smaller comp to 'same' Xilinx comp), the Xilinx’s KU035 had better results, in all these characteristics. For example, when compiling Altera’s GX320, F35 (same package as Altera’s

GX480) which should be 'equal' to Xilinx’s KU035, for 44 replications:

Quartus utilization for GX320 for 44 replications, Test C:

Logic utilization (in ALMs) 139,107 / 119,900 ( 116 % )

And compilation failed. Not enough place in device.

Xilinx utilization for KU035 for 44 replications, Test C:

60%.

C. When compared ARRIA 10 GX270 to Xilinx’s KU035, I had similar results in all

characteristics (did not check all replications).

Notes:

2 very important keynotes I've discovered after conducting this comparison (which

should tip the scale in favor of Intel/Altera, and nevertheless, Xilinx results are much

better):

• Xilinx FPGA chosen was smaller than Altera. This means Xilinx P&R algorithm

must work harder to reach the desired frequency (since less space is

available). Nevertheless, Xilinx results are much better.

• Xilinx FPGA speed is the slowest, compared to Altera (which is the fastest).

This means Altera results should be better. Nevertheless, it is much worse.