Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis 1 1 2...

Preview:

Citation preview

The Case for Embedded NoCs on FPGAs

Mohamed ABDELFATTAHVaughn BETZ

2

Outline

Why NoCs on FPGAs?

Embedded NoCs

Area & Power Analysis

1

2

3

Comparison Against P2P/Buses4

3

Interconnect

Motivation1. Why NoCs on FPGAs?

Logic Blocks

Switch Blocks

Wires

4

Motivation1. Why NoCs on FPGAs?

Logic Blocks

Switch Blocks

Wires

Hard Blocks:• Memory• Multiplier• Processor

5

Motivation1. Why NoCs on FPGAs?

Logic Blocks

Switch Blocks

Wires

Hard InterfacesDDR/PCIe ..

Interconnect still the same

Hard Blocks:• Memory• Multiplier• Processor

1600 MHz

200 MHz

800 MHz

6

MotivationDDR3 PHY and Controller

Problems:1. Bandwidth requirements for

hard logic/interfaces2. Timing closure

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

1600 MHz

200 MHz

800 MHz

7

MotivationDDR3 PHY and Controller

Problems:1. Bandwidth requirements for

hard logic/interfaces2. Timing closure3. High interconnect utilization:

– Huge CAD Problem– Slow compilation– Power/area utilization

4. Wire speed not scaling:– Delay is interconnect-dominated

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

Barcelona Los Angeles

Keep the “roads”, but add “freeways”.

Hard Blocks

Logic Cluster

Source: Google Earth

9

DDR3 PHY and Controller

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

Problems:1. Bandwidth requirements for

hard logic/interfaces2. Timing closure3. High interconnect utilization:

– Huge CAD Problem– Slow compilation– Power/area utilization

4. Wire speed not scaling:– Delay is interconnect-dominated

FPGA with NoCNoC

Routers

Links Router forwards data packet

Router moves data to local interconnect

10

DDR3 PHY and Controller

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

Problems:1. Bandwidth requirements for

hard logic/interfaces2. Timing closure3. High interconnect utilization:

– Huge CAD Problem– Slow compilation– Power/area utilization

4. Wire speed not scaling:– Delay is interconnect-dominated

5. Abstraction favours modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect

FPGA with NoC

Pre-design NoC to requirements NoC links are “re-usable” NoC is heavily “pipelined” NoC abstraction favors modularity

High bandwidth endpoints known

11

DDR3 PHY and Controller

1. Why NoCs on FPGAs?PCIe Controller

Gigabit Ethernet

FPGA with NoC

Latency-tolerant communication NoC abstraction favors modularity

Problems:1. Bandwidth requirements for

hard logic/interfaces2. Timing closure3. High interconnect utilization:

– Huge CAD Problem– Slow compilation– Power/area utilization

4. Wire speed not scaling:– Delay is interconnect-dominated

5. Abstraction favours modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect

NoCs can simplify FPGA design

Does the NoC abstraction come at a high area/power cost?

How to integrate NoCs in FPGAs?

How do embedded NoCs compare to current interconnects?

12

OutlineWhy NoCs on FPGAs?

Embedded NoCs

Area & Power Analysis

1

2

3

Mixed NoCs Hard NoCs

Comparison Against P2P/Buses4

Embedded NoCsFPGA

DD

Rx In

terf

ace

PCIe

Inte

rfac

e

Router

Compute Module

Links(Hard or Soft)

Fabric

Port

(Hard or Soft)

2. Embedded NoCs

“Mixed” NoC

“Hard” NoC

Soft LinksHard Routers

Hard LinksHard Routers =++

=“Soft” NoCSoft LinksSoft Routers + =

14

Soft Hard

FPGA CAD Tools ASIC CAD Tools

Design Compiler

Area

Speed

Power?Power

Methodology

Toggle rates

Gate-level simulation Gate-level simulation

Mixed

HSPICE

15

Router Logic

Programmable Interconnect

FPGA

Router

Mixed NoCs2. Embedded NoCs

Logic blocks

Baseline Router

Programmable“soft” interconnect

Width VCs Ports Buffer

32 2 5 10/VC

“Mixed” NoCSoft LinksHard Routers + =

16

Router Logic

Programmable Interconnect

FPGA

Router

Mixed NoCs2. Embedded NoCs

Router Logic

16“Mixed” NoCSoft LinksHard Routers + =

17

Router Logic

Programmable Interconnect

Router

Assumed a mesh Can form any topology

FPGA

Mixed NoCs2. Embedded NoCs

Special FeatureConfigurable topology

18

Router Logic

Dedicated Interconnect

FPGA

Router

Hard NoCs2. Embedded NoCs

Logic blocks

Dedicated “hard” interconnect

Programmable“soft” interconnect

18“Hard” NoCHard LinksHard Routers + =

19

Router Logic

Dedicated Interconnect

FPGA

Router

Hard NoCs2. Embedded NoCs

Router Logic

19“Hard” NoCHard LinksHard Routers + =

20

Router Logic

Dedicated Interconnect

FPGA

Router

Hard NoCs2. Embedded NoCs

Low-V mode

1.1 V0.9 V

Save 33% Dynamic Power

Special Feature

~15% slower

20“Hard” NoCHard LinksHard Routers + =

21

Fabric Port2. Embedded NoCs

21

Router

Compute Module

Links(Hard or Soft)

Fabric

Port

(Hard or Soft)• Width adaptation

• Frequency adaptation

• Voltage adaptation

Bridge NoC and FPGA fabric:

• Bus protocol e.g. AXI

22

OutlineWhy NoCs on FPGAs?

Embedded NoCs

1

2

Area & Power Analysis

Soft vs. mixed vs.Hard

3

System Area/Power

Comparison Against P2P/Buses4

23

Router Microarchitecture

State-of-the-art router architecture from Stanford:1. NoC community have excelled at building on-chip routers:

We just use it2. To meet FPGA bandwidth requirements:

High-performance router3. Complex functionality such as virtual channels:

Assigning traffic priority could be useful

3. Area/Power Analysis

Input Modules Output Modules

Virtual Channel (VC) Allocator

Switch Allocator

Crossbar Switch

1

5

1

5

Routers and Links

24

3. Area/Power Analysis

Hard Router vs. Soft Router

9X smaller, 2.4X faster, 1.4X lower power

30X smaller, 6X faster, 14X lower power

Hard Links vs. Soft Links

Soft, Mixed and Hard

Mixed Hard Soft

Speed

Speed

Bisection BW

~ 1.5% of FPGA33% of FPGA

730 – 940 MHz166 MHz

~ 50 GB/s~ 10 GB/s

64 –

NoC

[65 nm]

3. Area/Power Analysis

576 LBs~12,500 LBsArea

448 LBs

64-node NoC on Stratix III

Soft, Mixed and Hard

Mixed Hard (Low-V)Soft

Speed

Speed

Bisection BW

~ 1.5% of FPGA33% of FPGA

730 – 940 MHz166 MHz

~ 50 GB/s~ 10 GB/s

64 –

NoC

[65 nm]

3. Area/Power Analysis

576 LBs~12,500 LBsArea

448 LBs

Provides ~50GB/s peak bisection bandwidth

Very Cheap! Less than cost of 3 soft nodes

64-node NoC on Stratix III

28

NoC Power BudgetSoft NoC Mixed NoC Hard NoC Hard NoC (Low-V)

17.4 W

250 GB/s total bandwidth

Typical FPGA Dynamic Power

123%How much is used for system-level communication?

3. Area/Power Analysis

Largest Stratix-III device

29

NoC Power BudgetSoft NoC Mixed NoC Hard NoC Hard NoC (Low-V)

17.4 W

NoC

250 GB/s total bandwidth 15%

Typical FPGA Dynamic Power

3. Area/Power Analysis

123%

30

NoC Power Budget

NoC

17.4 WTypical FPGA

Dynamic Power

Soft NoC Mixed NoC Hard NoC Hard NoC (Low-V)250 GB/s total bandwidth 15%123% 11%

3. Area/Power Analysis

31

NoC Power Budget

NoC

17.4 WTypical FPGA

Dynamic Power

Soft NoC Mixed NoC Hard NoC Hard NoC (Low-V)250 GB/s total bandwidth 15%123% 11% 7%

3. Area/Power Analysis

32

Bandwidth in Perspective

14.6 GB/s

14.6 GB/s

14.6 GB/s

14.6 GB/s

17 G

B/s

17 G

B/s

17 G

B/s

17 G

B/s

DDR3 Module 1

PCIe Module 2

Full theoretical BW

126 GB/sAggregate Bandwidth

3.5%NoC Power Budget

Cross whole chip!

3. Area/Power Analysis

33

OutlineWhy NoCs on FPGAs?

Embedded NoCs

1

2

Area &Power Analysis

Point-to-point links

3

Comparison Against P2P/Buses4

Qsys Buses

34

FPGA Interconnect

1 1

Point-to-point Links

Broadcast

1 1

n

Multiple Masters

1

1Mux + Arbiter

n

Multiple Masters, Multiple Slaves

1 1Mux + Arbiter

n nMux + Arbiter

Interconnect = Just wires Interconnect = Wires + Logic Interconnect = NoC

1 .. .. ..

.. .. .. ..

.. .. ..

.. .. .. n

..Compare “wires” interconnect to NoCs

4. Comparison

35

NoC Power vs. FPGA Interconnect

Hard and Mixed NoCs Area/Power Efficient

Length of 1 NoC Link1 % area overhead on Stratix 5

Runs at 730-943 MHz

Power on-par with simplest FPGA interconnect

200 MHz

High Performance / Packet Switched

4. Comparison

36

DDR3: Qsys Bus vs. NoC4. Comparison

Qsys bus: Build logical bus from fabric

Embedded NoC: 16 Nodes, hard routers & links

37

Design Effort4. Comparison

• Steps to close timing using Qsys

close

FPGA

38

Design Effort4. Comparison

• Steps to close timing using Qsys

far

FPGA

39

Design Effort4. Comparison

• Steps to close timing using Qsys

far

FPGA

Timing closure can be simplified with an embedded NoC

40

Area Comparison4. Comparison

41

Area Comparison4. Comparison

42

Area Comparison4. Comparison

Entire NoC smaller than bus for 3 modules!

43

Area Comparison4. Comparison

1/8 Hard NoC BW used already less area for most systems

44

Power Comparison4. Comparison

Hard NoC saves power for even the simplest systems

1

2

3

Big city needs freeways to handle traffic

Area: 20-23X

Why NoCs on FPGAs?

Embedded NoCs: Mixed & Hard

Area & Power Analysis

Speed: 5-6X Power: 9-15X

• Area Budget for 64 nodes: ~1%• Power Budget for 100 GB/s: 3-7%

Comparison Against P2P/Buses4• Raw efficiency close to simplest P2P links• NoC more efficient & lower design effort

46

eecg.utoronto.ca/~mohamed/noc_designer.html

47

Thank You!

eecg.utoronto.ca/~mohamed/noc_designer.html

48

200 MHz 128-bit module, 900 MHz 32-bit router? Configurable time-domain mux / demux: match bandwidth Asynchronous FIFO: cross clock domains Full NoC bandwidth, w/o clock restrictions on modules

2. Embedded NoCs

Fabric Port

49

1. Why NoCs on FPGAs?

Compute Acceleration

• Maxeler• Geoscience (14x, 70x)• Financial analysis (5x, 163x)

• Altera OpenCL• Video compression (3x, 114x)• Information filtering (5.5x)

GPU CPU

50

1. Why NoCs on FPGAs?

Compute Acceleration

51

1. Why NoCs on FPGAs?

Compute Acceleration

52

1. Why NoCs on FPGAs?

Compute Acceleration

NoC

Recommended