111
RUHR-UNIVERSIT˜T BOCHUM FGPU: An SIMT-Architecture for FPGAs Muhammed Al Kadi, Benedikt Janssen, Michael Huebner Ruhr University of Bochum Chair for Embedded Systems for Information Technology FPGA’16, Monterey, 23 February 2016 www.esit.rub.de

FGPU: An SIMT-Architecture for FPGAs

Embed Size (px)

Citation preview

Page 1: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU: An SIMT-Architecture for FPGAsMuhammed Al Kadi, Benedikt Janssen, Michael HuebnerRuhr University of Bochum

Chair for Embedded Systems for Information Technology

FPGA’16, Monterey, 23 February 2016www.esit.rub.de

Page 2: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Outline

1 Motivation and Background

2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller

3 Implementation and Results

4 Future Work and Conclusion

FGPU | FPGA’16, Monterey, 23 February 2016 2

Page 3: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Outline

1 Motivation and Background

2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller

3 Implementation and Results

4 Future Work and Conclusion

FGPU | FPGA’16, Monterey, 23 February 2016 2

Page 4: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

MotivationWhat is FGPU?

An FPGA-GPU for general purpose computing

A portable, flexible and scalable softprocessor

A multi-core GPU-like architecture

Its ISA is designed to support execution ofOpenCL kernels

Capable of interfacing many AXI4-compatibledata interfaces with an internal L1 cache

It does not replicate any other architecture

Implemented completely in VHDL-2002

Threads Scheduler

AXI Control Interface

PE0

PE1

PE2

PE3

PE4

PE5

PE6

PE7

CU Memory Controller

Cache Memory

AXI DataInterfaces

Global Memory Controller

Com

pute

Devic

es

Global Memory

FGPU | FPGA’16, Monterey, 23 February 2016 3

Page 5: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

MotivationWhy FGPU?

Standard programming: OpenCL-compatible, noPRAGMAs necessary, shorter development cycles

Scalable task management: new tasks occupyno extra area on the FPGA

Design space exploration: not possible withhard embedded GPUs

Application specific adaptations: to achieve thebest area/power and performance trade-off

High effeciency: compared to other softarchitectures

NoC

FGPU | FPGA’16, Monterey, 23 February 2016 4

Page 6: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

MotivationWhy FGPU?

Standard programming: OpenCL-compatible, noPRAGMAs necessary, shorter development cycles

Scalable task management: new tasks occupyno extra area on the FPGA

Design space exploration: not possible withhard embedded GPUs

Application specific adaptations: to achieve thebest area/power and performance trade-off

High effeciency: compared to other softarchitectures

NoC

FGPU | FPGA’16, Monterey, 23 February 2016 4

Page 7: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

MotivationWhy FGPU?

Standard programming: OpenCL-compatible, noPRAGMAs necessary, shorter development cycles

Scalable task management: new tasks occupyno extra area on the FPGA

Design space exploration: not possible withhard embedded GPUs

Application specific adaptations: to achieve thebest area/power and performance trade-off

High effeciency: compared to other softarchitectures

NoC

FGPU | FPGA’16, Monterey, 23 February 2016 4

Page 8: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

MotivationWhy FGPU?

Standard programming: OpenCL-compatible, noPRAGMAs necessary, shorter development cycles

Scalable task management: new tasks occupyno extra area on the FPGA

Design space exploration: not possible withhard embedded GPUs

Application specific adaptations: to achieve thebest area/power and performance trade-off

High effeciency: compared to other softarchitectures

NoC

FGPU | FPGA’16, Monterey, 23 February 2016 4

Page 9: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

MotivationWhy FGPU?

Standard programming: OpenCL-compatible, noPRAGMAs necessary, shorter development cycles

Scalable task management: new tasks occupyno extra area on the FPGA

Design space exploration: not possible withhard embedded GPUs

Application specific adaptations: to achieve thebest area/power and performance trade-off

High effeciency: compared to other softarchitectures

NoC

FGPU | FPGA’16, Monterey, 23 February 2016 4

Page 10: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

MotivationWhy FGPU?

Standard programming: OpenCL-compatible, noPRAGMAs necessary, shorter development cycles

Scalable task management: new tasks occupyno extra area on the FPGA

Design space exploration: not possible withhard embedded GPUs

Application specific adaptations: to achieve thebest area/power and performance trade-off

High effeciency: compared to other softarchitectures

NoC

FGPU | FPGA’16, Monterey, 23 February 2016 4

Page 11: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

BackgroundSIMT and GPGPU

SIMT (Single-Instruction Multiple-Treads)

is a parallel execution model to program many-core architectures.

A single instruction can be concurrently executed by multiple threads.

Thread are scheduled at runtime on the available cores.

GPGPU (General Purpose Computing on Graphical Processing Units)

An efficient solution for many application e.g. filtering, scientificsimulations, matrix operations, sorting, DSP etc.

Embedded GPUs are easier to program with OpenCL or CUDA thanFPGAs with HDLs.

FGPU | FPGA’16, Monterey, 23 February 2016 5

Page 12: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

BackgroundSIMT and GPGPU

SIMT (Single-Instruction Multiple-Treads)

is a parallel execution model to program many-core architectures.

A single instruction can be concurrently executed by multiple threads.

Thread are scheduled at runtime on the available cores.

GPGPU (General Purpose Computing on Graphical Processing Units)

An efficient solution for many application e.g. filtering, scientificsimulations, matrix operations, sorting, DSP etc.

Embedded GPUs are easier to program with OpenCL or CUDA thanFPGAs with HDLs.

FGPU | FPGA’16, Monterey, 23 February 2016 5

Page 13: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

BackgroundSIMT and GPGPU

SIMT (Single-Instruction Multiple-Treads)

is a parallel execution model to program many-core architectures.

A single instruction can be concurrently executed by multiple threads.

Thread are scheduled at runtime on the available cores.

GPGPU (General Purpose Computing on Graphical Processing Units)

An efficient solution for many application e.g. filtering, scientificsimulations, matrix operations, sorting, DSP etc.

Embedded GPUs are easier to program with OpenCL or CUDA thanFPGAs with HDLs.

FGPU | FPGA’16, Monterey, 23 February 2016 5

Page 14: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Outline

1 Motivation and Background

2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller

3 Implementation and Results

4 Future Work and Conclusion

FGPU | FPGA’16, Monterey, 23 February 2016 5

Page 15: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Outline

1 Motivation and Background

2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller

3 Implementation and Results

4 Future Work and Conclusion

FGPU | FPGA’16, Monterey, 23 February 2016 5

Page 16: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureExecution Model (OpenCL-Compatible)

Threads (Work-items) get coordinated in a 1-, 2- or 3D index spaceWork-items get scheduled together inWork-groups (WGs) on idle coresWGs are splitted intoWavefronts (WFs)WF’s work-items share the same program counter

Example (Array multiplication)

0 N-1

0 1 N-1

*0 1 N-1

0 1 N-1

=

Thread (Work-item)

i

1st Array

2nd Array

WavefrontWork-group

Result Array

1D Index Space

FGPU | FPGA’16, Monterey, 23 February 2016 6

Page 17: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureExecution Model (OpenCL-Compatible)

Threads (Work-items) get coordinated in a 1-, 2- or 3D index spaceWork-items get scheduled together inWork-groups (WGs) on idle coresWGs are splitted intoWavefronts (WFs)WF’s work-items share the same program counter

Example (Array multiplication)

0 N-1

0 1 N-1

*0 1 N-1

0 1 N-1

=

Thread (Work-item)

i

1st Array

2nd Array

WavefrontWork-group

Result Array

1D Index Space

FGPU | FPGA’16, Monterey, 23 February 2016 6

Page 18: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureExecution Model (OpenCL-Compatible)

Threads (Work-items) get coordinated in a 1-, 2- or 3D index spaceWork-items get scheduled together inWork-groups (WGs) on idle coresWGs are splitted intoWavefronts (WFs)WF’s work-items share the same program counter

Example (Array multiplication)

0 N-1

0 1 N-1

*0 1 N-1

0 1 N-1

=

Thread (Work-item)

i

1st Array

2nd Array

WavefrontWork-group

Result Array

1D Index Space

FGPU | FPGA’16, Monterey, 23 February 2016 6

Page 19: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureExecution Model (OpenCL-Compatible)

Threads (Work-items) get coordinated in a 1-, 2- or 3D index spaceWork-items get scheduled together inWork-groups (WGs) on idle coresWGs are splitted intoWavefronts (WFs)WF’s work-items share the same program counter

Example (Array multiplication)

0 N-1

0 1 N-1

*0 1 N-1

0 1 N-1

=

Thread (Work-item)

i

1st Array

2nd Array

WavefrontWork-group

Result Array

1D Index Space

FGPU | FPGA’16, Monterey, 23 February 2016 6

Page 20: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureExecution Model (OpenCL-Compatible)

Threads (Work-items) get coordinated in a 1-, 2- or 3D index spaceWork-items get scheduled together inWork-groups (WGs) on idle coresWGs are splitted intoWavefronts (WFs)WF’s work-items share the same program counter

Example (Array multiplication)

0 N-1

0 1 N-1

*0 1 N-1

0 1 N-1

=

Thread (Work-item)

i

1st Array

2nd Array

WavefrontWork-group

Result Array

1D Index Space

FGPU | FPGA’16, Monterey, 23 February 2016 6

Page 21: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Outline

1 Motivation and Background

2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller

3 Implementation and Results

4 Future Work and Conclusion

FGPU | FPGA’16, Monterey, 23 February 2016 6

Page 22: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitecturePlatform Model

WG

Dis

patc

her

AX

I C

ontr

ol In

terf

ace

CRAM

LRAM

ControlRegisters

WFScheduler

CU

RTM

PE0

PE1

PE2

PE3

PE4

PE5

PE6

PE7

CU Memory Controller

Cache Memory

AXI DataInterfaces

Global MemoryController

FGPU accommodates several Compute Units (CUs), each holds a singlearray of Processing Elements (PEs)

All PEs in a CU execute the same instruction at the same time

The binary code is executed from the Code RAM (CRAM)

Information that do not belong to the binary, e.g. # of work-items tolaunch or parameter values, has to be stored in the Link RAM (LRAM)

FGPU | FPGA’16, Monterey, 23 February 2016 7

Page 23: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitecturePlatform Model

WG

Dis

patc

her

AX

I C

ontr

ol In

terf

ace

CRAM

LRAM

ControlRegisters

WFScheduler

CU

RTM

PE0

PE1

PE2

PE3

PE4

PE5

PE6

PE7

CU Memory Controller

Cache Memory

AXI DataInterfaces

Global MemoryController

FGPU accommodates several Compute Units (CUs), each holds a singlearray of Processing Elements (PEs)

All PEs in a CU execute the same instruction at the same time

The binary code is executed from the Code RAM (CRAM)

Information that do not belong to the binary, e.g. # of work-items tolaunch or parameter values, has to be stored in the Link RAM (LRAM)

FGPU | FPGA’16, Monterey, 23 February 2016 7

Page 24: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitecturePlatform Model

WG

Dis

patc

her

AX

I C

ontr

ol In

terf

ace

CRAM

LRAM

ControlRegisters

WFScheduler

CU

RTM

PE0

PE1

PE2

PE3

PE4

PE5

PE6

PE7

CU Memory Controller

Cache Memory

AXI DataInterfaces

Global MemoryController

FGPU accommodates several Compute Units (CUs), each holds a singlearray of Processing Elements (PEs)

All PEs in a CU execute the same instruction at the same time

The binary code is executed from the Code RAM (CRAM)

Information that do not belong to the binary, e.g. # of work-items tolaunch or parameter values, has to be stored in the Link RAM (LRAM)

FGPU | FPGA’16, Monterey, 23 February 2016 7

Page 25: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitecturePlatform Model

WG

Dis

patc

her

AX

I C

ontr

ol In

terf

ace

CRAM

LRAM

ControlRegisters

WFScheduler

CU

RTM

PE0

PE1

PE2

PE3

PE4

PE5

PE6

PE7

CU Memory Controller

Cache Memory

AXI DataInterfaces

Global MemoryController

FGPU accommodates several Compute Units (CUs), each holds a singlearray of Processing Elements (PEs)

All PEs in a CU execute the same instruction at the same time

The binary code is executed from the Code RAM (CRAM)

Information that do not belong to the binary, e.g. # of work-items tolaunch or parameter values, has to be stored in the Link RAM (LRAM)

FGPU | FPGA’16, Monterey, 23 February 2016 7

Page 26: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitecturePlatform Model

WG

Dis

patc

her

AX

I C

ontr

ol In

terf

ace

CRAM

LRAM

ControlRegisters

WFScheduler

CU

RTM

PE0

PE1

PE2

PE3

PE4

PE5

PE6

PE7

CU Memory Controller

Cache Memory

AXI DataInterfaces

Global MemoryController

FGPU accommodates several Compute Units (CUs), each holds a singlearray of Processing Elements (PEs)

All PEs in a CU execute the same instruction at the same time

The binary code is executed from the Code RAM (CRAM)

Information that do not belong to the binary, e.g. # of work-items tolaunch or parameter values, has to be stored in the Link RAM (LRAM)

FGPU | FPGA’16, Monterey, 23 February 2016 7

Page 27: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitecturePlatform Model - continued

WG

Dis

patc

her

AX

I C

ontr

ol In

terf

ace

CRAM

LRAM

ControlRegisters

WFScheduler

CU

RTM

PE0

PE1

PE2

PE3

PE4

PE5

PE6

PE7

CU Memory Controller

Cache Memory

AXI DataInterfaces

Global MemoryController

TheWG Dispatcher assigns WGs on idle CUs

TheWF Scheduler admits WFs for execution, e.g. after a memory accessis performed

The Runtime Memory (RTM) implements theWork-Item Built-In Functionsdefined by OpenCL, e.g. feeding coordinates in the index space.

FGPU | FPGA’16, Monterey, 23 February 2016 8

Page 28: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitecturePlatform Model - continued

WG

Dis

patc

her

AX

I C

ontr

ol In

terf

ace

CRAM

LRAM

ControlRegisters

WFScheduler

CU

RTM

PE0

PE1

PE2

PE3

PE4

PE5

PE6

PE7

CU Memory Controller

Cache Memory

AXI DataInterfaces

Global MemoryController

TheWG Dispatcher assigns WGs on idle CUs

TheWF Scheduler admits WFs for execution, e.g. after a memory accessis performed

The Runtime Memory (RTM) implements theWork-Item Built-In Functionsdefined by OpenCL, e.g. feeding coordinates in the index space.

FGPU | FPGA’16, Monterey, 23 February 2016 8

Page 29: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitecturePlatform Model - continued

WG

Dis

patc

her

AX

I C

ontr

ol In

terf

ace

CRAM

LRAM

ControlRegisters

WFScheduler

CU

RTM

PE0

PE1

PE2

PE3

PE4

PE5

PE6

PE7

CU Memory Controller

Cache Memory

AXI DataInterfaces

Global MemoryController

TheWG Dispatcher assigns WGs on idle CUs

TheWF Scheduler admits WFs for execution, e.g. after a memory accessis performed

The Runtime Memory (RTM) implements theWork-Item Built-In Functionsdefined by OpenCL, e.g. feeding coordinates in the index space.

FGPU | FPGA’16, Monterey, 23 February 2016 8

Page 30: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitecturePlatform Model - continued

WG

Dis

patc

her

AX

I C

ontr

ol In

terf

ace

CRAM

LRAM

ControlRegisters

WFScheduler

CU

RTM

PE0

PE1

PE2

PE3

PE4

PE5

PE6

PE7

CU Memory Controller

Cache Memory

AXI DataInterfaces

Global MemoryController

TheWG Dispatcher assigns WGs on idle CUs

TheWF Scheduler admits WFs for execution, e.g. after a memory accessis performed

The Runtime Memory (RTM) implements theWork-Item Built-In Functionsdefined by OpenCL, e.g. feeding coordinates in the index space.

FGPU | FPGA’16, Monterey, 23 February 2016 8

Page 31: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitecturePlatform Model - continued

WG

Dis

patc

her

AX

I C

ontr

ol In

terf

ace

CRAM

LRAM

ControlRegisters

WFScheduler

CU

RTM

PE0

PE1

PE2

PE3

PE4

PE5

PE6

PE7

CU Memory Controller

Cache Memory

AXI DataInterfaces

Global MemoryController

TheWG Dispatcher assigns WGs on idle CUs

TheWF Scheduler admits WFs for execution, e.g. after a memory accessif performed

The Runtime Memory (RTM) implements theWork-Item Built-In Functionsdefined by OpenCL, e.g. feeding coordinates in the index space.

FGPU | FPGA’16, Monterey, 23 February 2016 8

Page 32: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Outline

1 Motivation and Background

2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller

3 Implementation and Results

4 Future Work and Conclusion

FGPU | FPGA’16, Monterey, 23 February 2016 8

Page 33: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureCompute Unit Architecture

8 WFs can be managed within a single CU

The same instruction gets executed over8 clock cycles on 8 PEs

A work-item owns 32 registers (32bit)

Register files are held in dual-port RAMsand switched with no latency

The pipeline consists of 18 stages!

Doubled clock frequency for the registerfiles and the ALUs

64 outstanding memory requests can bemanaged at the same time

Wavefront Manager RTM

ALU

Register FileA

B

Wavefront

7

Instruction

Instr. Fetch

clk_2x

CRAM

WG Scheduler

Wavefront Scheduler

0

Station

Buffer

Wri

te B

ack

PE0

PE7

R/W Addr & Data

Read Data32*N

Compute Vector

CU Memory Controller

8x9

32

32

8x32

8x32

Read Address

FIFO

FGPU | FPGA’16, Monterey, 23 February 2016 9

Page 34: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureCompute Unit Architecture

8 WFs can be managed within a single CU

The same instruction gets executed over8 clock cycles on 8 PEs

A work-item owns 32 registers (32bit)

Register files are held in dual-port RAMsand switched with no latency

The pipeline consists of 18 stages!

Doubled clock frequency for the registerfiles and the ALUs

64 outstanding memory requests can bemanaged at the same time

Wavefront Manager RTM

ALU

Register FileA

B

Wavefront

7

Instruction

Instr. Fetch

clk_2x

CRAM

WG Scheduler

Wavefront Scheduler

0

Station

Buffer

Wri

te B

ack

PE0

PE7

R/W Addr & Data

Read Data32*N

Compute Vector

CU Memory Controller

8x9

32

32

8x32

8x32

Read Address

FIFO

FGPU | FPGA’16, Monterey, 23 February 2016 9

Page 35: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureCompute Unit Architecture

8 WFs can be managed within a single CU

The same instruction gets executed over8 clock cycles on 8 PEs

A work-item owns 32 registers (32bit)

Register files are held in dual-port RAMsand switched with no latency

The pipeline consists of 18 stages!

Doubled clock frequency for the registerfiles and the ALUs

64 outstanding memory requests can bemanaged at the same time

Wavefront Manager RTM

ALU

Register FileA

B

Wavefront

7

Instruction

Instr. Fetch

clk_2x

CRAM

WG Scheduler

Wavefront Scheduler

0

Station

Buffer

Wri

te B

ack

PE0

PE7

R/W Addr & Data

Read Data32*N

Compute Vector

CU Memory Controller

8x9

32

32

8x32

8x32

Read Address

FIFO

FGPU | FPGA’16, Monterey, 23 February 2016 9

Page 36: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureCompute Unit Architecture

8 WFs can be managed within a single CU

The same instruction gets executed over8 clock cycles on 8 PEs

A work-item owns 32 registers (32bit)

Register files are held in dual-port RAMsand switched with no latency

The pipeline consists of 18 stages!

Doubled clock frequency for the registerfiles and the ALUs

64 outstanding memory requests can bemanaged at the same time

Wavefront Manager RTM

ALU

Register FileA

B

Wavefront

7

Instruction

Instr. Fetch

clk_2x

CRAM

WG Scheduler

Wavefront Scheduler

0

Station

Buffer

Wri

te B

ack

PE0

PE7

R/W Addr & Data

Read Data32*N

Compute Vector

CU Memory Controller

8x9

32

32

8x32

8x32

Read Address

FIFO

FGPU | FPGA’16, Monterey, 23 February 2016 9

Page 37: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureCompute Unit Architecture

8 WFs can be managed within a single CU

The same instruction gets executed over8 clock cycles on 8 PEs

A work-item owns 32 registers (32bit)

Register files are held in dual-port RAMsand switched with no latency

The pipeline consists of 18 stages!

Doubled clock frequency for the registerfiles and the ALUs

64 outstanding memory requests can bemanaged at the same time

Wavefront Manager RTM

ALU

Register FileA

B

Wavefront

7

Instruction

Instr. Fetch

clk_2x

CRAM

WG Scheduler

Wavefront Scheduler

0

Station

Buffer

Wri

te B

ack

PE0

PE7

R/W Addr & Data

Read Data32*N

Compute Vector

CU Memory Controller

8x9

32

32

8x32

8x32

Read Address

FIFO

FGPU | FPGA’16, Monterey, 23 February 2016 9

Page 38: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureCompute Unit Architecture

8 WFs can be managed within a single CU

The same instruction gets executed over8 clock cycles on 8 PEs

A work-item owns 32 registers (32bit)

Register files are held in dual-port RAMsand switched with no latency

The pipeline consists of 18 stages!

Doubled clock frequency for the registerfiles and the ALUs

64 outstanding memory requests can bemanaged at the same time

Wavefront Manager RTM

ALU

Register FileA

B

Wavefront

7

Instruction

Instr. Fetch

clk_2x

CRAM

WG Scheduler

Wavefront Scheduler

0

Station

Buffer

Wri

te B

ack

PE0

PE7

R/W Addr & Data

Read Data32*N

Compute Vector

CU Memory Controller

8x9

32

32

8x32

8x32

Read Address

FIFO

FGPU | FPGA’16, Monterey, 23 February 2016 9

Page 39: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureCompute Unit Architecture

8 WFs can be managed within a single CU

The same instruction gets executed over8 clock cycles on 8 PEs

A work-item owns 32 registers (32bit)

Register files are held in dual-port RAMsand switched with no latency

The pipeline consists of 18 stages!

Doubled clock frequency for the registerfiles and the ALUs

64 outstanding memory requests can bemanaged at the same time

Wavefront Manager RTM

ALU

Register FileA

B

Wavefront

7

Instruction

Instr. Fetch

clk_2x

CRAM

WG Scheduler

Wavefront Scheduler

0

Station

Buffer

Wri

te B

ack

PE0

PE7

R/W Addr & Data

Read Data32*N

Compute Vector

CU Memory Controller

8x9

32

32

8x32

8x32

Read Address

FIFO

FGPU | FPGA’16, Monterey, 23 February 2016 9

Page 40: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureCompute Unit Architecture

8 WFs can be managed within a single CU

The same instruction gets executed over8 clock cycles on 8 PEs

A work-item owns 32 registers (32bit)

Register files are held in dual-port RAMsand switched with no latency

The pipeline consists of 18 stages!

Doubled clock frequency for the registerfiles and the ALUs

64 outstanding memory requests can bemanaged at the same time

Wavefront Manager RTM

ALU

Register FileA

B

Wavefront

7

Instruction

Instr. Fetch

clk_2x

CRAM

WG Scheduler

Wavefront Scheduler

0

Station

Buffer

Wri

te B

ack

PE0

PE7

R/W Addr & Data

Read Data32*N

Compute Vector

CU Memory Controller

8x9

32

32

8x32

8x32

Read Address

FIFO

FGPU | FPGA’16, Monterey, 23 February 2016 9

Page 41: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Outline

1 Motivation and Background

2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller

3 Implementation and Results

4 Future Work and Conclusion

FGPU | FPGA’16, Monterey, 23 February 2016 9

Page 42: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureGlobal Memory Controller Architecture

Byte Dirty

Tag Controller

StationCU

Station

PO

RT A

PO

RT A

PO

RT B

PO

RT B

Cache

Global Memory Controller

Rd DataRd Addr

Cache Cleaner

Tag Controller

Tag MemoryTag Manager

Write Pipeline

AXI Controller

AXI Port

ARADDR

RDATA

WDATA

WSTRB

AWADDR

It processes up to 64 outstanding requests from multiple CUs

Not served requests get increasing priorities as they are waiting

A multi-bank, directed mapped cache with write-back strategy is included

Accesses to global memory are parallelized over many AXI4-ports

Data is transfered in bursts that fill a single cache line

FGPU | FPGA’16, Monterey, 23 February 2016 10

Page 43: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureGlobal Memory Controller Architecture

Byte Dirty

Tag Controller

StationCU

Station

PO

RT A

PO

RT A

PO

RT B

PO

RT B

Cache

Global Memory Controller

Rd DataRd Addr

Cache Cleaner

Tag Controller

Tag MemoryTag Manager

Write Pipeline

AXI Controller

AXI Port

ARADDR

RDATA

WDATA

WSTRB

AWADDR

It processes up to 64 outstanding requests from multiple CUs

Not served requests get increasing priorities as they are waiting

A multi-bank, directed mapped cache with write-back strategy is included

Accesses to global memory are parallelized over many AXI4-ports

Data is transfered in bursts that fill a single cache line

FGPU | FPGA’16, Monterey, 23 February 2016 10

Page 44: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureGlobal Memory Controller Architecture

Byte Dirty

Tag Controller

StationCU

Station

PO

RT A

PO

RT A

PO

RT B

PO

RT B

Cache

Global Memory Controller

Rd DataRd Addr

Cache Cleaner

Tag Controller

Tag MemoryTag Manager

Write Pipeline

AXI Controller

AXI Port

ARADDR

RDATA

WDATA

WSTRB

AWADDR

It processes up to 64 outstanding requests from multiple CUs

Not served requests get increasing priorities as they are waiting

A multi-bank, directed mapped cache with write-back strategy is included

Accesses to global memory are parallelized over many AXI4-ports

Data is transfered in bursts that fill a single cache line

FGPU | FPGA’16, Monterey, 23 February 2016 10

Page 45: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureGlobal Memory Controller Architecture

Byte Dirty

Tag Controller

StationCU

Station

PO

RT A

PO

RT A

PO

RT B

PO

RT B

Cache

Global Memory Controller

Rd DataRd Addr

Cache Cleaner

Tag Controller

Tag MemoryTag Manager

Write Pipeline

AXI Controller

AXI Port

ARADDR

RDATA

WDATA

WSTRB

AWADDR

It processes up to 64 outstanding requests from multiple CUs

Not served requests get increasing priorities as they are waiting

A multi-bank, directed mapped cache with write-back strategy is included

Accesses to global memory are parallelized over many AXI4-ports

Data is transfered in bursts that fill a single cache line

FGPU | FPGA’16, Monterey, 23 February 2016 10

Page 46: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureGlobal Memory Controller Architecture

Byte Dirty

Tag Controller

StationCU

Station

PO

RT A

PO

RT A

PO

RT B

PO

RT B

Cache

Global Memory Controller

Rd DataRd Addr

Cache Cleaner

Tag Controller

Tag MemoryTag Manager

Write Pipeline

AXI Controller

AXI Port

ARADDR

RDATA

WDATA

WSTRB

AWADDR

It processes up to 64 outstanding requests from multiple CUs

Not served requests get increasing priorities as they are waiting

A multi-bank, directed mapped cache with write-back strategy is included

Accesses to global memory are parallelized over many AXI4-ports

Data is transfered in bursts that fill a single cache line

FGPU | FPGA’16, Monterey, 23 February 2016 10

Page 47: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ArchitectureGlobal Memory Controller Architecture

Byte Dirty

Tag Controller

StationCU

Station

PO

RT A

PO

RT A

PO

RT B

PO

RT B

Cache

Global Memory Controller

Rd DataRd Addr

Cache Cleaner

Tag Controller

Tag MemoryTag Manager

Write Pipeline

AXI Controller

AXI Port

ARADDR

RDATA

WDATA

WSTRB

AWADDR

It processes up to 64 outstanding requests from multiple CUs

Not served requests get increasing priorities as they are waiting

A multi-bank, directed mapped cache with write-back strategy is included

Accesses to global memory are parallelized over many AXI4-ports

Data is transfered in bursts that fill a single cache line

FGPU | FPGA’16, Monterey, 23 February 2016 10

Page 48: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Outline

1 Motivation and Background

2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller

3 Implementation and Results

4 Future Work and Conclusion

FGPU | FPGA’16, Monterey, 23 February 2016 10

Page 49: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSystem Setup

Development board: ZC706 (ZynqXC7Z045) from Xilinx

Two other solutions for comparison:

The hard ARM-Cortex A9

The soft MicroBlaze

Processing data located in the DDRmodule

The benchmark was compiled with -O3 asbare-metal application

Only 32bit integer operations have beenconsidered

Cache flushing was always performed

DDR3- SDRAM

Coretex-A9CPU 0

NEON Engine

Coretex-A9CPU 1

NEON EngineMemoryInterface

HP 0 HP 1 HP 2 HP 3

AXI Slave HP Interconnect

Central Memory Interconnect

AXI Master

PS

ControlInterface

MemoryController

Compute Units

Cache

PL

Zynq XC7Z045

FGPU | FPGA’16, Monterey, 23 February 2016 11

Page 50: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSystem Setup

Development board: ZC706 (ZynqXC7Z045) from Xilinx

Two other solutions for comparison:

The hard ARM-Cortex A9

The soft MicroBlaze

Processing data located in the DDRmodule

The benchmark was compiled with -O3 asbare-metal application

Only 32bit integer operations have beenconsidered

Cache flushing was always performed

DDR3- SDRAM

Coretex-A9CPU 0

NEON Engine

Coretex-A9CPU 1

NEON EngineMemoryInterface

HP 0 HP 1 HP 2 HP 3

AXI Slave HP Interconnect

Central Memory Interconnect

AXI Master

PS

ControlInterface

MemoryController

Compute Units

Cache

PL

Zynq XC7Z045

FGPU | FPGA’16, Monterey, 23 February 2016 11

Page 51: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSystem Setup

Development board: ZC706 (ZynqXC7Z045) from Xilinx

Two other solutions for comparison:

The hard ARM-Cortex A9

The soft MicroBlaze

Processing data located in the DDRmodule

The benchmark was compiled with -O3 asbare-metal application

Only 32bit integer operations have beenconsidered

Cache flushing was always performed

DDR3- SDRAM

Coretex-A9CPU 0

NEON Engine

Coretex-A9CPU 1

NEON EngineMemoryInterface

HP 0 HP 1 HP 2 HP 3

AXI Slave HP Interconnect

Central Memory Interconnect

AXI Master

PS

ControlInterface

MemoryController

Compute Units

Cache

PL

Zynq XC7Z045

FGPU | FPGA’16, Monterey, 23 February 2016 11

Page 52: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSystem Setup

Development board: ZC706 (ZynqXC7Z045) from Xilinx

Two other solutions for comparison:

The hard ARM-Cortex A9

The soft MicroBlaze

Processing data located in the DDRmodule

The benchmark was compiled with -O3 asbare-metal application

Only 32bit integer operations have beenconsidered

Cache flushing was always performed

DDR3- SDRAM

Coretex-A9CPU 0

NEON Engine

Coretex-A9CPU 1

NEON EngineMemoryInterface

HP 0 HP 1 HP 2 HP 3

AXI Slave HP Interconnect

Central Memory Interconnect

AXI Master

PS

ControlInterface

MemoryController

Compute Units

Cache

PL

Zynq XC7Z045

FGPU | FPGA’16, Monterey, 23 February 2016 11

Page 53: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSystem Setup

Development board: ZC706 (ZynqXC7Z045) from Xilinx

Two other solutions for comparison:

The hard ARM-Cortex A9

The soft MicroBlaze

Processing data located in the DDRmodule

The benchmark was compiled with -O3 asbare-metal application

Only 32bit integer operations have beenconsidered

Cache flushing was always performed

DDR3- SDRAM

Coretex-A9CPU 0

NEON Engine

Coretex-A9CPU 1

NEON EngineMemoryInterface

HP 0 HP 1 HP 2 HP 3

AXI Slave HP Interconnect

Central Memory Interconnect

AXI Master

PS

ControlInterface

MemoryController

Compute Units

Cache

PL

Zynq XC7Z045

FGPU | FPGA’16, Monterey, 23 February 2016 11

Page 54: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSystem Setup

Development board: ZC706 (ZynqXC7Z045) from Xilinx

Two other solutions for comparison:

The hard ARM-Cortex A9

The soft MicroBlaze

Processing data located in the DDRmodule

The benchmark was compiled with -O3 asbare-metal application

Only 32bit integer operations have beenconsidered

Cache flushing was always performed

DDR3- SDRAM

Coretex-A9CPU 0

NEON Engine

Coretex-A9CPU 1

NEON EngineMemoryInterface

HP 0 HP 1 HP 2 HP 3

AXI Slave HP Interconnect

Central Memory Interconnect

AXI Master

PS

ControlInterface

MemoryController

Compute Units

Cache

PL

Zynq XC7Z045

FGPU | FPGA’16, Monterey, 23 February 2016 11

Page 55: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSystem Setup

Development board: ZC706 (ZynqXC7Z045) from Xilinx

Two other solutions for comparison:

The hard ARM-Cortex A9

The soft MicroBlaze

Processing data located in the DDRmodule

The benchmark was compiled with -O3 asbare-metal application

Only 32bit integer operations have beenconsidered

Cache flushing was always performed

DDR3- SDRAM

Coretex-A9CPU 0

NEON Engine

Coretex-A9CPU 1

NEON EngineMemoryInterface

HP 0 HP 1 HP 2 HP 3

AXI Slave HP Interconnect

Central Memory Interconnect

AXI Master

PS

ControlInterface

MemoryController

Compute Units

Cache

PL

Zynq XC7Z045

FGPU | FPGA’16, Monterey, 23 February 2016 11

Page 56: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ImplementationHighlights

Scalability

Up to 4K threads run simultaneously on

64 PEs

Slight degradation for the operating

frequencies for bigger designs

Portability: no IP-cores or primitives

Even DSP slices in pipeline mode were

targeted without using any IP-cores

Flexibility

Many parameters can be configured to

meet the best performance/area

trade-offFloorplan for 8 CUs on XC7Z045

FGPU | FPGA’16, Monterey, 23 February 2016 12

Page 57: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ImplementationHighlights

Scalability

Up to 4K threads run simultaneously on

64 PEs

Slight degradation for the operating

frequencies for bigger designs

Portability: no IP-cores or primitives

Even DSP slices in pipeline mode were

targeted without using any IP-cores

Flexibility

Many parameters can be configured to

meet the best performance/area

trade-offFloorplan for 8 CUs on XC7Z045

FGPU | FPGA’16, Monterey, 23 February 2016 12

Page 58: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ImplementationHighlights

Scalability

Up to 4K threads run simultaneously on

64 PEs

Slight degradation for the operating

frequencies for bigger designs

Portability: no IP-cores or primitives

Even DSP slices in pipeline mode were

targeted without using any IP-cores

Flexibility

Many parameters can be configured to

meet the best performance/area

trade-offFloorplan for 8 CUs on XC7Z045

FGPU | FPGA’16, Monterey, 23 February 2016 12

Page 59: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

FGPU ImplementationHighlights

Scalability

Up to 4K threads run simultaneously on

64 PEs

Slight degradation for the operating

frequencies for bigger designs

Portability: no IP-cores or primitives

Even DSP slices in pipeline mode were

targeted without using any IP-cores

Flexibility

Many parameters can be configured to

meet the best performance/area

trade-offFloorplan for 8 CUs on XC7Z045

FGPU | FPGA’16, Monterey, 23 February 2016 12

Page 60: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsArea Requirements

Many FGPUs with different settingswere tested

Clock frequencies were fixed at200MHz/400MHz

BRAMs and DSPs were targeted butwithout the manufacturer’s IP-cores

More FFs than LUTs were consumeddue to the deep pipeline

Empty pipeline stages were inserted

between the two clock domains to

improve timing

Adjustable parameters with direct influence on design scalability

Module Parameter Range

CU# CUs 2, 4, 8

# Outstanding mem. requests 16, 24, 32

Cache Size 1 to 8KBMemory # Cache Read Banks 2, 4, 8

Controller # Outstanding mem. requests 32, 64# AXI4 interfaces 1, 2, 4# Tag Managers 2, 4, 8, 16

Area requirements for different configurations

LUTs FFs BRAM DSPs

Available 219K 437K 545 900

8 CUs 57% 36% 31% 21%

4 CUs 35% 20% 18% 11%

2 CUs 26% 14% 12% 5.3%

2 CUs(min) 9.6% 8.1% 8.5% 5.3%

MicroBlaze 3.2% 1.3% 5.0% 0.7%

FGPU | FPGA’16, Monterey, 23 February 2016 13

Page 61: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsArea Requirements

Many FGPUs with different settingswere tested

Clock frequencies were fixed at200MHz/400MHz

BRAMs and DSPs were targeted butwithout the manufacturer’s IP-cores

More FFs than LUTs were consumeddue to the deep pipeline

Empty pipeline stages were inserted

between the two clock domains to

improve timing

Adjustable parameters with direct influence on design scalability

Module Parameter Range

CU# CUs 2, 4, 8

# Outstanding mem. requests 16, 24, 32

Cache Size 1 to 8KBMemory # Cache Read Banks 2, 4, 8

Controller # Outstanding mem. requests 32, 64# AXI4 interfaces 1, 2, 4# Tag Managers 2, 4, 8, 16

Area requirements for different configurations

LUTs FFs BRAM DSPs

Available 219K 437K 545 900

8 CUs 57% 36% 31% 21%

4 CUs 35% 20% 18% 11%

2 CUs 26% 14% 12% 5.3%

2 CUs(min) 9.6% 8.1% 8.5% 5.3%

MicroBlaze 3.2% 1.3% 5.0% 0.7%

FGPU | FPGA’16, Monterey, 23 February 2016 13

Page 62: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsArea Requirements

Many FGPUs with different settingswere tested

Clock frequencies were fixed at200MHz/400MHz

BRAMs and DSPs were targeted butwithout the manufacturer’s IP-cores

More FFs than LUTs were consumeddue to the deep pipeline

Empty pipeline stages were inserted

between the two clock domains to

improve timing

Adjustable parameters with direct influence on design scalability

Module Parameter Range

CU# CUs 2, 4, 8

# Outstanding mem. requests 16, 24, 32

Cache Size 1 to 8KBMemory # Cache Read Banks 2, 4, 8

Controller # Outstanding mem. requests 32, 64# AXI4 interfaces 1, 2, 4# Tag Managers 2, 4, 8, 16

Area requirements for different configurations

LUTs FFs BRAM DSPs

Available 219K 437K 545 900

8 CUs 57% 36% 31% 21%

4 CUs 35% 20% 18% 11%

2 CUs 26% 14% 12% 5.3%

2 CUs(min) 9.6% 8.1% 8.5% 5.3%

MicroBlaze 3.2% 1.3% 5.0% 0.7%

FGPU | FPGA’16, Monterey, 23 February 2016 13

Page 63: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsArea Requirements

Many FGPUs with different settingswere tested

Clock frequencies were fixed at200MHz/400MHz

BRAMs and DSPs were targeted butwithout the manufacturer’s IP-cores

More FFs than LUTs were consumeddue to the deep pipeline

Empty pipeline stages were inserted

between the two clock domains to

improve timing

Adjustable parameters with direct influence on design scalability

Module Parameter Range

CU# CUs 2, 4, 8

# Outstanding mem. requests 16, 24, 32

Cache Size 1 to 8KBMemory # Cache Read Banks 2, 4, 8

Controller # Outstanding mem. requests 32, 64# AXI4 interfaces 1, 2, 4# Tag Managers 2, 4, 8, 16

Area requirements for different configurations

LUTs FFs BRAM DSPs

Available 219K 437K 545 900

8 CUs 57% 36% 31% 21%

4 CUs 35% 20% 18% 11%

2 CUs 26% 14% 12% 5.3%

2 CUs(min) 9.6% 8.1% 8.5% 5.3%

MicroBlaze 3.2% 1.3% 5.0% 0.7%

FGPU | FPGA’16, Monterey, 23 February 2016 13

Page 64: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsArea Requirements

Many FGPUs with different settingswere tested

Clock frequencies were fixed at200MHz/400MHz

BRAMs and DSPs were targeted butwithout the manufacturer’s IP-cores

More FFs than LUTs were consumeddue to the deep pipeline

Empty pipeline stages were inserted

between the two clock domains to

improve timing

Adjustable parameters with direct influence on design scalability

Module Parameter Range

CU# CUs 2, 4, 8

# Outstanding mem. requests 16, 24, 32

Cache Size 1 to 8KBMemory # Cache Read Banks 2, 4, 8

Controller # Outstanding mem. requests 32, 64# AXI4 interfaces 1, 2, 4# Tag Managers 2, 4, 8, 16

Area requirements for different configurations

LUTs FFs BRAM DSPs

Available 219K 437K 545 900

8 CUs 57% 36% 31% 21%

4 CUs 35% 20% 18% 11%

2 CUs 26% 14% 12% 5.3%

2 CUs(min) 9.6% 8.1% 8.5% 5.3%

MicroBlaze 3.2% 1.3% 5.0% 0.7%

FGPU | FPGA’16, Monterey, 23 February 2016 13

Page 65: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsArea Requirements

Many FGPUs with different settingswere tested

Clock frequencies were fixed at200MHz/400MHz

BRAMs and DSPs were targeted butwithout the manufacturer’s IP-cores

More FFs than LUTs were consumeddue to the deep pipeline

Empty pipeline stages were inserted

between the two clock domains to

improve timing

Adjustable parameters with direct influence on design scalability

Module Parameter Range

CU# CUs 2, 4, 8

# Outstanding mem. requests 16, 24, 32

Cache Size 1 to 8KBMemory # Cache Read Banks 2, 4, 8

Controller # Outstanding mem. requests 32, 64# AXI4 interfaces 1, 2, 4# Tag Managers 2, 4, 8, 16

Area requirements for different configurations

LUTs FFs BRAM DSPs

Available 219K 437K 545 900

8 CUs 57% 36% 31% 21%

4 CUs 35% 20% 18% 11%

2 CUs 26% 14% 12% 5.3%

2 CUs(min) 9.6% 8.1% 8.5% 5.3%

MicroBlaze 3.2% 1.3% 5.0% 0.7%

FGPU | FPGA’16, Monterey, 23 February 2016 13

Page 66: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSpeedup over MicroBlaze

The MicroBlaze is:

configured by default settings for

best performance

clocked at 185MHz

49x average speedup was achievedoverall

166x maximum speedup for matrixmultiplication

7.7 Gbps maximum throughput whenused like a DMA (memcopy), averagedover the whole execution time

0

100

43

48

47

50

65

47

66

64

58

85

1416

17

1722

12

Wall clock time speedup for 8 CUs over MicroBlaze implementation for task size between 256 and 256K

FIR (20 taps)

matrix multiplication

cross correlation

vecadd & vecmul

memcopy

FIR (5 taps)

GeomeanMaxMin

FGPU | FPGA’16, Monterey, 23 February 2016 14

Page 67: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSpeedup over MicroBlaze

The MicroBlaze is:

configured by default settings for

best performance

clocked at 185MHz

49x average speedup was achievedoverall

166x maximum speedup for matrixmultiplication

7.7 Gbps maximum throughput whenused like a DMA (memcopy), averagedover the whole execution time

0

100

43

48

47

50

65

47

66

64

58

85

1416

17

1722

12

Wall clock time speedup for 8 CUs over MicroBlaze implementation for task size between 256 and 256K

FIR (20 taps)

matrix multiplication

cross correlation

vecadd & vecmul

memcopy

FIR (5 taps)

GeomeanMaxMin

FGPU | FPGA’16, Monterey, 23 February 2016 14

Page 68: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSpeedup over MicroBlaze

The MicroBlaze is:

configured by default settings for

best performance

clocked at 185MHz

49x average speedup was achievedoverall

166x maximum speedup for matrixmultiplication

7.7 Gbps maximum throughput whenused like a DMA (memcopy), averagedover the whole execution time

0

100

43

48

47

50

65

47

66

64

58

85

1416

17

1722

12

Wall clock time speedup for 8 CUs over MicroBlaze implementation for task size between 256 and 256K

FIR (20 taps)

matrix multiplication

cross correlation

vecadd & vecmul

memcopy

FIR (5 taps)

GeomeanMaxMin

FGPU | FPGA’16, Monterey, 23 February 2016 14

Page 69: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSpeedup over MicroBlaze

The MicroBlaze is:

configured by default settings for

best performance

clocked at 185MHz

49x average speedup was achievedoverall

166x maximum speedup for matrixmultiplication

7.7 Gbps maximum throughput whenused like a DMA (memcopy), averagedover the whole execution time

0

100

43

48

47

50

65

47

66

64

58

85

1416

17

1722

12

Wall clock time speedup for 8 CUs over MicroBlaze implementation for task size between 256 and 256K

FIR (20 taps)

matrix multiplication

cross correlation

vecadd & vecmul

memcopy

FIR (5 taps)

GeomeanMaxMin

FGPU | FPGA’16, Monterey, 23 February 2016 14

Page 70: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSpeedup over MicroBlaze

The MicroBlaze is:

configured by default settings for

best performance

clocked at 185MHz

49x average speedup was achievedoverall

166x maximum speedup for matrixmultiplication

7.7 Gbps maximum throughput whenused like a DMA (memcopy), averagedover the whole execution time

0

100

43

48

47

50

65

47

66

64

58

85

1416

17

1722

12

Wall clock time speedup for 8 CUs over MicroBlaze implementation for task size between 256 and 256K

FIR (20 taps)

matrix multiplication

cross correlation

vecadd & vecmul

memcopy

FIR (5 taps)

GeomeanMaxMin

FGPU | FPGA’16, Monterey, 23 February 2016 14

Page 71: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSpeedup over MicroBlaze

The MicroBlaze is:

configured by default settings for

best performance

clocked at 185MHz

49x average speedup was achievedoverall

166x maximum speedup for matrixmultiplication

7.7 Gbps maximum throughput whenused like a DMA (memcopy), averagedover the whole execution time

0

100

43

48

47

50

65

47

66

64

58

85

1416

17

1722

12

Wall clock time speedup for 8 CUs over MicroBlaze implementation for task size between 256 and 256K

FIR (20 taps)

matrix multiplication

cross correlation

vecadd & vecmul

memcopy

FIR (5 taps)

GeomeanMaxMin

FGPU | FPGA’16, Monterey, 23 February 2016 14

Page 72: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSpeedup over ARM+NEON

The ARM CPU is:

supported by the NEON vector engine

clocked at 667MHz

3.5x average speedup was achievedoverall

35x maximum speedup for matrixmultiplication

Better speedups were recorded formore computational intensive tasks

Cache flushing on ARM was done onlyfor the dirty region

0

33

3

7

5

3

2

4

5

4 4

7

11

2

2 1

1

1

Wall clock time speedup for 8 CUs over ARM+NEON implementation for task size between 256 and 256K

FIR (20 taps)

matrix multiplication cross correlation

vecadd & vecmul

FIR (5 taps)

GeomeanMaxMin

transpose

memcopy

FGPU | FPGA’16, Monterey, 23 February 2016 15

Page 73: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSpeedup over ARM+NEON

The ARM CPU is:

supported by the NEON vector engine

clocked at 667MHz

3.5x average speedup was achievedoverall

35x maximum speedup for matrixmultiplication

Better speedups were recorded formore computational intensive tasks

Cache flushing on ARM was done onlyfor the dirty region

0

33

3

7

5

3

2

4

5

4 4

7

11

2

2 1

1

1

Wall clock time speedup for 8 CUs over ARM+NEON implementation for task size between 256 and 256K

FIR (20 taps)

matrix multiplication cross correlation

vecadd & vecmul

FIR (5 taps)

GeomeanMaxMin

transpose

memcopy

FGPU | FPGA’16, Monterey, 23 February 2016 15

Page 74: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSpeedup over ARM+NEON

The ARM CPU is:

supported by the NEON vector engine

clocked at 667MHz

3.5x average speedup was achievedoverall

35x maximum speedup for matrixmultiplication

Better speedups were recorded formore computational intensive tasks

Cache flushing on ARM was done onlyfor the dirty region

0

33

3

7

5

3

2

4

5

4 4

7

11

2

2 1

1

1

Wall clock time speedup for 8 CUs over ARM+NEON implementation for task size between 256 and 256K

FIR (20 taps)

matrix multiplication cross correlation

vecadd & vecmul

FIR (5 taps)

GeomeanMaxMin

transpose

memcopy

FGPU | FPGA’16, Monterey, 23 February 2016 15

Page 75: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSpeedup over ARM+NEON

The ARM CPU is:

supported by the NEON vector engine

clocked at 667MHz

3.5x average speedup was achievedoverall

35x maximum speedup for matrixmultiplication

Better speedups were recorded formore computational intensive tasks

Cache flushing on ARM was done onlyfor the dirty region

0

33

3

7

5

3

2

4

5

4 4

7

11

2

2 1

1

1

Wall clock time speedup for 8 CUs over ARM+NEON implementation for task size between 256 and 256K

FIR (20 taps)

matrix multiplication cross correlation

vecadd & vecmul

FIR (5 taps)

GeomeanMaxMin

transpose

memcopy

FGPU | FPGA’16, Monterey, 23 February 2016 15

Page 76: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSpeedup over ARM+NEON

The ARM CPU is:

supported by the NEON vector engine

clocked at 667MHz

3.5x average speedup was achievedoverall

35x maximum speedup for matrixmultiplication

Better speedups were recorded formore computational intensive tasks

Cache flushing on ARM was done onlyfor the dirty region

0

33

3

7

5

3

2

4

5

4 4

7

11

2

2 1

1

1

Wall clock time speedup for 8 CUs over ARM+NEON implementation for task size between 256 and 256K

FIR (20 taps)

matrix multiplication cross correlation

vecadd & vecmul

FIR (5 taps)

GeomeanMaxMin

transpose

memcopy

FGPU | FPGA’16, Monterey, 23 February 2016 15

Page 77: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSpeedup over ARM+NEON

The ARM CPU is:

supported by the NEON vector engine

clocked at 667MHz

3.5x average speedup was achievedoverall

35x maximum speedup for matrixmultiplication

Better speedups were recorded formore computational intensive tasks

Cache flushing on ARM was done onlyfor the dirty region

0

33

3

7

5

3

2

4

5

4 4

7

11

2

2 1

1

1

Wall clock time speedup for 8 CUs over ARM+NEON implementation for task size between 256 and 256K

FIR (20 taps)

matrix multiplication cross correlation

vecadd & vecmul

FIR (5 taps)

GeomeanMaxMin

transpose

memcopy

FGPU | FPGA’16, Monterey, 23 February 2016 15

Page 78: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSpeedup over ARM+NEON

The ARM CPU is:

supported by the NEON vector engine

clocked at 667MHz

3.5x average speedup was achievedoverall

35x maximum speedup for matrixmultiplication

Better speedups were recorded formore computational intensive tasks

Cache flushing on ARM was done onlyfor the dirty region

0

33

3

7

5

3

2

4

5

4 4

7

11

2

2 1

1

1

Wall clock time speedup for 8 CUs over ARM+NEON implementation for task size between 256 and 256K

FIR (20 taps)

matrix multiplication cross correlation

vecadd & vecmul

FIR (5 taps)

GeomeanMaxMin

transpose

memcopy

FGPU | FPGA’16, Monterey, 23 February 2016 15

Page 79: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsProblem Size

Better speedups achieved whenprocessing bigger pieces of data

The speedup increases more rapidlyfor the most complex applications

A min. of 4us is needed by FGPU for:

preparing to execute a new task at

the beginning

flushing the cache content at the end

64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

11.00

12.00

memcopy

vecmul & vecadd

FIR(5 taps)

cross correlation

Wall clock time speedup for 8 CUs over ARM+NEONimplementation for variable task size

Task Size

Spee

dup

over

AR

M+N

EON

FGPU | FPGA’16, Monterey, 23 February 2016 16

Page 80: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsProblem Size

Better speedups achieved whenprocessing bigger pieces of data

The speedup increases more rapidlyfor the most complex applications

A min. of 4us is needed by FGPU for:

preparing to execute a new task at

the beginning

flushing the cache content at the end

64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

11.00

12.00

memcopy

vecmul & vecadd

FIR(5 taps)

cross correlation

Wall clock time speedup for 8 CUs over ARM+NEONimplementation for variable task size

Task Size

Spee

dup

over

AR

M+N

EON

FGPU | FPGA’16, Monterey, 23 February 2016 16

Page 81: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsProblem Size

Better speedups achieved whenprocessing bigger pieces of data

The speedup increases more rapidlyfor the most complex applications

A min. of 4us is needed by FGPU for:

preparing to execute a new task at

the beginning

flushing the cache content at the end

64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

11.00

12.00

memcopy

vecmul & vecadd

FIR(5 taps)

cross correlation

Wall clock time speedup for 8 CUs over ARM+NEONimplementation for variable task size

Task Size

Spee

dup

over

AR

M+N

EON

FGPU | FPGA’16, Monterey, 23 February 2016 16

Page 82: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsProblem Size

Better speedups achieved whenprocessing bigger pieces of data

The speedup increases more rapidlyfor the most complex applications

A min. of 4us is needed by FGPU for:

preparing to execute a new task at

the beginning

flushing the cache content at the end

64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

11.00

12.00

memcopy

vecmul & vecadd

FIR(5 taps)

cross correlation

Wall clock time speedup for 8 CUs over ARM+NEONimplementation for variable task size

Task Size

Spee

dup

over

AR

M+N

EON

FGPU | FPGA’16, Monterey, 23 February 2016 16

Page 83: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsProblem Size

Better speedups achieved whenprocessing bigger pieces of data

The speedup increases more rapidlyfor the most complex applications

A min. of 4us is needed by FGPU for:

preparing to execute a new task at

the beginning

flushing the cache content at the end

64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

11.00

12.00

memcopy

vecmul & vecadd

FIR(5 taps)

cross correlation

Wall clock time speedup for 8 CUs over ARM+NEONimplementation for variable task size

Task Size

Spee

dup

over

AR

M+N

EON

FGPU | FPGA’16, Monterey, 23 February 2016 16

Page 84: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsProblem Size

Better speedups achieved whenprocessing bigger pieces of data

The speedup increases more rapidlyfor the most complex applications

A min. of 4us is needed by FGPU for:

preparing to execute a new task at

the beginning

flushing the cache content at the end

64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

11.00

12.00

memcopy

vecmul & vecadd

FIR(5 taps)

cross correlation

Wall clock time speedup for 8 CUs over ARM+NEONimplementation for variable task size

Task Size

Spee

dup

over

AR

M+N

EON

FGPU | FPGA’16, Monterey, 23 February 2016 16

Page 85: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsNumber of CUs

Implementing more CUs improvedalways the speedups

linearly in best cases

even for less computationally

intensive applications

Having multiple CUs improves theefficiency of the memory controller

by pushing more memory requests in

the pipeline

8 4 20.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

Number of CUs

memcopyvecadd & vecmulFIR (5taps)cross correlationmatrix multiplication

Spee

dup

over

Mic

roB

laze

Wall clock time speedup for FGPU with different number ofCUs over MicroBlaze implementation for a task size of 16K

FGPU | FPGA’16, Monterey, 23 February 2016 17

Page 86: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsNumber of CUs

Implementing more CUs improvedalways the speedups

linearly in best cases

even for less computationally

intensive applications

Having multiple CUs improves theefficiency of the memory controller

by pushing more memory requests in

the pipeline

8 4 20.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

Number of CUs

memcopyvecadd & vecmulFIR (5taps)cross correlationmatrix multiplication

Spee

dup

over

Mic

roB

laze

Wall clock time speedup for FGPU with different number ofCUs over MicroBlaze implementation for a task size of 16K

FGPU | FPGA’16, Monterey, 23 February 2016 17

Page 87: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsNumber of CUs

Implementing more CUs improvedalways the speedups

linearly in best cases

even for less computationally

intensive applications

Having multiple CUs improves theefficiency of the memory controller

by pushing more memory requests in

the pipeline

8 4 20.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

Number of CUs

memcopyvecadd & vecmulFIR (5taps)cross correlationmatrix multiplication

Spee

dup

over

Mic

roB

laze

Wall clock time speedup for FGPU with different number ofCUs over MicroBlaze implementation for a task size of 16K

FGPU | FPGA’16, Monterey, 23 February 2016 17

Page 88: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsNumber of CUs

Implementing more CUs improvedalways the speedups

linearly in best cases

even for less computationally

intensive applications

Having multiple CUs improves theefficiency of the memory controller

by pushing more memory requests in

the pipeline

8 4 20.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

Number of CUs

memcopyvecadd & vecmulFIR (5taps)cross correlationmatrix multiplication

Spee

dup

over

Mic

roB

laze

Wall clock time speedup for FGPU with different number ofCUs over MicroBlaze implementation for a task size of 16K

FGPU | FPGA’16, Monterey, 23 February 2016 17

Page 89: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsNumber of CUs

Implementing more CUs improvedalways the speedups

linearly in best cases

even for less computationally

intensive applications

Having multiple CUs improves theefficiency of the memory controller

by pushing more memory requests in

the pipeline

8 4 20.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

Number of CUs

memcopyvecadd & vecmulFIR (5taps)cross correlationmatrix multiplication

Spee

dup

over

Mic

roB

laze

Wall clock time speedup for FGPU with different number ofCUs over MicroBlaze implementation for a task size of 16K

FGPU | FPGA’16, Monterey, 23 February 2016 17

Page 90: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsPower Consumption

The measurements were done using theon-board power monitor

The 2nd ARM core was recording thepower consumption while the 1st wascontrolling FGPU

In average over the whole benchmark:

The biggest and smallest FGPUs

consumed 13x and 3.3x more power than

the MicroBlaze.

5.2W was recorded as maximum by the

power monitor

8 CUs 4 CUs 2 CUs 2 CUs (Min)0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

memcopy

FIR (5 taps)

FIR (20 taps)

matrix multiply

cross correlation

vecmul & vecadd

transpose

Estimated Average after P&R

Measured Average

Rat

io o

f con

sum

ed p

ower

ove

r M

icro

Bla

ze

FGPU | FPGA’16, Monterey, 23 February 2016 18

Page 91: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsPower Consumption

The measurements were done using theon-board power monitor

The 2nd ARM core was recording thepower consumption while the 1st wascontrolling FGPU

In average over the whole benchmark:

The biggest and smallest FGPUs

consumed 13x and 3.3x more power than

the MicroBlaze.

5.2W was recorded as maximum by the

power monitor

8 CUs 4 CUs 2 CUs 2 CUs (Min)0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

memcopy

FIR (5 taps)

FIR (20 taps)

matrix multiply

cross correlation

vecmul & vecadd

transpose

Estimated Average after P&R

Measured Average

Rat

io o

f con

sum

ed p

ower

ove

r M

icro

Bla

ze

FGPU | FPGA’16, Monterey, 23 February 2016 18

Page 92: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsPower Consumption

The measurements were done using theon-board power monitor

The 2nd ARM core was recording thepower consumption while the 1st wascontrolling FGPU

In average over the whole benchmark:

The biggest and smallest FGPUs

consumed 13x and 3.3x more power than

the MicroBlaze.

5.2W was recorded as maximum by the

power monitor

8 CUs 4 CUs 2 CUs 2 CUs (Min)0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

memcopy

FIR (5 taps)

FIR (20 taps)

matrix multiply

cross correlation

vecmul & vecadd

transpose

Estimated Average after P&R

Measured Average

Rat

io o

f con

sum

ed p

ower

ove

r M

icro

Bla

ze

FGPU | FPGA’16, Monterey, 23 February 2016 18

Page 93: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsPower Consumption

The measurements were done using theon-board power monitor

The 2nd ARM core was recording thepower consumption while the 1st wascontrolling FGPU

In average over the whole benchmark:

The biggest and smallest FGPUs

consumed 13x and 3.3x more power than

the MicroBlaze.

5.2W was recorded as maximum by the

power monitor

8 CUs 4 CUs 2 CUs 2 CUs (Min)0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

memcopy

FIR (5 taps)

FIR (20 taps)

matrix multiply

cross correlation

vecmul & vecadd

transpose

Estimated Average after P&R

Measured Average

Rat

io o

f con

sum

ed p

ower

ove

r M

icro

Bla

ze

FGPU | FPGA’16, Monterey, 23 February 2016 18

Page 94: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsPower Consumption

The measurements were done using theon-board power monitor

The 2nd ARM core was recording thepower consumption while the 1st wascontrolling FGPU

In average over the whole benchmark:

The biggest and smallest FGPUs

consumed 13x and 3.3x more power than

the MicroBlaze.

5.2W was recorded as maximum by the

power monitor

8 CUs 4 CUs 2 CUs 2 CUs (Min)0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

memcopy

FIR (5 taps)

FIR (20 taps)

matrix multiply

cross correlation

vecmul & vecadd

transpose

Estimated Average after P&R

Measured Average

Rat

io o

f con

sum

ed p

ower

ove

r M

icro

Bla

ze

FGPU | FPGA’16, Monterey, 23 February 2016 18

Page 95: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSummary

In comparison to the MicroBlaze, alltested FGPUs:

consume 3.2x - 4.4x less energy

speedup execution by 11x - 49x

need 3.0x - 17.7x more area

In comparison to ARM+NEON, all testedFGPUs:

speedup execution by up to 3.5x

8 CUs 4 CUs 2 CUs 2CUs (Min)0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00 48.5

31.9

19.9

10.6

3.7 4.3 4.43.2

17.7

10.9

8.1

3.03.52.3

1.4 1.0

Speedup over MBPower Saving over MB#LUTs over MBSpeedup over ARM+NEON

Average wall clock time speedup, power saving and area overhead for different FGPUs over MicroBlaze and ARM+NEON implementations.Speedups were averaged over the whole benchmark and problem sizes from 256 to 256K

FGPU | FPGA’16, Monterey, 23 February 2016 19

Page 96: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSummary

In comparison to the MicroBlaze, alltested FGPUs:

consume 3.2x - 4.4x less energy

speedup execution by 11x - 49x

need 3.0x - 17.7x more area

In comparison to ARM+NEON, all testedFGPUs:

speedup execution by up to 3.5x

8 CUs 4 CUs 2 CUs 2CUs (Min)0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00 48.5

31.9

19.9

10.6

3.7 4.3 4.43.2

17.7

10.9

8.1

3.03.52.3

1.4 1.0

Speedup over MBPower Saving over MB#LUTs over MBSpeedup over ARM+NEON

Average wall clock time speedup, power saving and area overhead for different FGPUs over MicroBlaze and ARM+NEON implementations.Speedups were averaged over the whole benchmark and problem sizes from 256 to 256K

FGPU | FPGA’16, Monterey, 23 February 2016 19

Page 97: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSummary

In comparison to the MicroBlaze, alltested FGPUs:

consume 3.2x - 4.4x less energy

speedup execution by 11x - 49x

need 3.0x - 17.7x more area

In comparison to ARM+NEON, all testedFGPUs:

speedup execution by up to 3.5x

8 CUs 4 CUs 2 CUs 2CUs (Min)0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00 48.5

31.9

19.9

10.6

3.7 4.3 4.43.2

17.7

10.9

8.1

3.03.52.3

1.4 1.0

Speedup over MBPower Saving over MB#LUTs over MBSpeedup over ARM+NEON

Average wall clock time speedup, power saving and area overhead for different FGPUs over MicroBlaze and ARM+NEON implementations.Speedups were averaged over the whole benchmark and problem sizes from 256 to 256K

FGPU | FPGA’16, Monterey, 23 February 2016 19

Page 98: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSummary

In comparison to the MicroBlaze, alltested FGPUs:

consume 3.2x - 4.4x less energy

speedup execution by 11x - 49x

need 3.0x - 17.7x more area

In comparison to ARM+NEON, all testedFGPUs:

speedup execution by up to 3.5x

8 CUs 4 CUs 2 CUs 2CUs (Min)0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00 48.5

31.9

19.9

10.6

3.7 4.3 4.43.2

17.7

10.9

8.1

3.03.52.3

1.4 1.0

Speedup over MBPower Saving over MB#LUTs over MBSpeedup over ARM+NEON

Average wall clock time speedup, power saving and area overhead for different FGPUs over MicroBlaze and ARM+NEON implementations.Speedups were averaged over the whole benchmark and problem sizes from 256 to 256K

FGPU | FPGA’16, Monterey, 23 February 2016 19

Page 99: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

ResultsSummary

In comparison to the MicroBlaze, alltested FGPUs:

consume 3.2x - 4.4x less energy

speedup execution by 11x - 49x

need 3.0x - 17.7x more area

In comparison to ARM+NEON, all testedFGPUs:

speedup execution by up to 3.5x

8 CUs 4 CUs 2 CUs 2CUs (Min)0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00 48.5

31.9

19.9

10.6

3.7 4.3 4.43.2

17.7

10.9

8.1

3.03.52.3

1.4 1.0

Speedup over MBPower Saving over MB#LUTs over MBSpeedup over ARM+NEON

Average wall clock time speedup, power saving and area overhead for different FGPUs over MicroBlaze and ARM+NEON implementations.Speedups were averaged over the whole benchmark and problem sizes from 256 to 256K

FGPU | FPGA’16, Monterey, 23 February 2016 19

Page 100: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Outline

1 Motivation and Background

2 FGPU ArchitectureExecution ModelPlatform ModelCompute Unit ArchitectureGlobal Memory Controller

3 Implementation and Results

4 Future Work and Conclusion

FGPU | FPGA’16, Monterey, 23 February 2016 19

Page 101: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Future Work

Extend the ISA to cover more benchmarks

Developing an LLVM-backend to compile OpenCL-kernels

Developing a Linux-driver with OpenCL-compatible interface

Enabling branches at the work-item level

Supporting soft/hard floating point computations

Providing local/global atomic operations

Improving the cache system by having two levels

Implementing local storage within the CUs to be compliant to theOpenCL memory model.

FGPU | FPGA’16, Monterey, 23 February 2016 20

Page 102: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Future Work

Extend the ISA to cover more benchmarks

Developing an LLVM-backend to compile OpenCL-kernels

Developing a Linux-driver with OpenCL-compatible interface

Enabling branches at the work-item level

Supporting soft/hard floating point computations

Providing local/global atomic operations

Improving the cache system by having two levels

Implementing local storage within the CUs to be compliant to theOpenCL memory model.

FGPU | FPGA’16, Monterey, 23 February 2016 20

Page 103: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Future Work

Extend the ISA to cover more benchmarks

Developing an LLVM-backend to compile OpenCL-kernels

Developing a Linux-driver with OpenCL-compatible interface

Enabling branches at the work-item level

Supporting soft/hard floating point computations

Providing local/global atomic operations

Improving the cache system by having two levels

Implementing local storage within the CUs to be compliant to theOpenCL memory model.

FGPU | FPGA’16, Monterey, 23 February 2016 20

Page 104: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Future Work

Extend the ISA to cover more benchmarks

Developing an LLVM-backend to compile OpenCL-kernels

Developing a Linux-driver with OpenCL-compatible interface

Enabling branches at the work-item level

Supporting soft/hard floating point computations

Providing local/global atomic operations

Improving the cache system by having two levels

Implementing local storage within the CUs to be compliant to theOpenCL memory model.

FGPU | FPGA’16, Monterey, 23 February 2016 20

Page 105: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Future Work

Extend the ISA to cover more benchmarks

Developing an LLVM-backend to compile OpenCL-kernels

Developing a Linux-driver with OpenCL-compatible interface

Enabling branches at the work-item level

Supporting soft/hard floating point computations

Providing local/global atomic operations

Improving the cache system by having two levels

Implementing local storage within the CUs to be compliant to theOpenCL memory model.

FGPU | FPGA’16, Monterey, 23 February 2016 20

Page 106: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Future Work

Extend the ISA to cover more benchmarks

Developing an LLVM-backend to compile OpenCL-kernels

Developing a Linux-driver with OpenCL-compatible interface

Enabling branches at the work-item level

Supporting soft/hard floating point computations

Providing local/global atomic operations

Improving the cache system by having two levels

Implementing local storage within the CUs to be compliant to theOpenCL memory model.

FGPU | FPGA’16, Monterey, 23 February 2016 20

Page 107: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Future Work

Extend the ISA to cover more benchmarks

Developing an LLVM-backend to compile OpenCL-kernels

Developing a Linux-driver with OpenCL-compatible interface

Enabling branches at the work-item level

Supporting soft/hard floating point computations

Providing local/global atomic operations

Improving the cache system by having two levels

Implementing local storage within the CUs to be compliant to theOpenCL memory model.

FGPU | FPGA’16, Monterey, 23 February 2016 20

Page 108: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Future Work

Extend the ISA to cover more benchmarks

Developing an LLVM-backend to compile OpenCL-kernels

Developing a Linux-driver with OpenCL-compatible interface

Enabling branches at the work-item level

Supporting soft/hard floating point computations

Providing local/global atomic operations

Improving the cache system by having two levels

Implementing local storage within the CUs to be compliant to theOpenCL memory model.

FGPU | FPGA’16, Monterey, 23 February 2016 20

Page 109: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Future Work

Extend the ISA to cover more benchmarks

Developing an LLVM-backend to compile OpenCL-kernels

Developing a Linux-driver with OpenCL-compatible interface

Enabling branches at the work-item level

Supporting soft/hard floating point computations

Providing local/global atomic operations

Improving the cache system by having two levels

Implementing local storage within the CUs to be compliant to theOpenCL memory model.

FGPU | FPGA’16, Monterey, 23 February 2016 20

Page 110: FGPU: An SIMT-Architecture for FPGAs

Thank you for your attention!

Page 111: FGPU: An SIMT-Architecture for FPGAs

RUHR-UNIVERSITÄT BOCHUM

Instruction Set ArchitectureExample: FIR Filter

__kernel void fir( int *input_array, int *filter, int *res_array, int filter_len){

int index = get_global_id(0);

int i = 0;

int acc = 0;

do {

acc += input_array[indx+i] * filter[i]; i++; } while(i != filter_len);

res[index] = acc;}

# FIR filter using 1D index space. It has 4 Paramaters:# 0 : address of the first element in input array# 1 : address of the first element in coefficients array# 2 : address of the first element in results array# 3 : filter length (L)LID r1, d0 # local ID: load the local work-item index in its work-group into r1WGOFF r2, d0# Work-Group Offset: load the work-group global offset into r2ADD r1, r1, r2 # ADD integers: r1 has now the global id of the work-itemLP r2, 3 # Load Parameter: r2 has filter lengthLP r3, 0 # Load Parameter: r3 is a pointer to the input arrayLP r4, 1 # Load Parameter: r4 is a pointer to the coefficients arrayADDI r5, r0, 0 # ADD Immediate: r5 will be the loop index (initialized with 0)ADDI r6, r0, 0 # ADD Immediate: r6 will contain the result (initialized with 0)

begin: LW r10, r4[r5] # Load Word: load a coefficients into r10ADD r11, r5, r1 # ADD integers: calculate the index of an element in input arrayLW r11, r3[r11] # Load Word: load the input element into r11MACC r6, r10, r11 # Multiply and ACComulate: update the resultADDI r5, r5, 1 # ADD Immediate: update loop indexBNE r5, r2, begin # Branch if Not Equal: repeat the iteration if necessary

LP r20, 2 # Load Parameter: r20 is a pointer to the result array SW r6, r20[r1] # Store Word: store the result r6 into the index r1 in result arrayRET #RETurn: end of task

(a) FIR filter as OpenCL kernel (b) Equivalent implementation in FGPU ISA

FGPU | FPGA’16, Monterey, 23 February 2016 22