Parallelism and Customization: Trends in Computer ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/plessl.pdf · Riding the Waves of Moore's Law • despite all computer

Parallelism and Customization: Trends in Computer Architectures for

Accelerating Scientific Computing

Jun.-Prof. Dr. Christian Plessl

2nd International Symposium "Computer Simulations on GPU" (SimGPU 2013)

Freudenstadt, Germany

Custom Computing Group University of Paderborn

2013-05-28

Motivation: Increasing Demand for Computing

•  computer simulation has been established as a standard method in many areas of science and engineering –  e.g. computational fluid dynamics, structure simulation, propagation of

electromagnetic fields, ... –  demand for finer temporal and spatial resolution or more complex models –  in many cases compute bound

•  new areas are addressed with different characteristics –  e.g. bioinformatics, large scale molecular dynamics, system's biology, ... –  compute but increasingly also memory bound (big data)

•  availability of high performance computing resources is a competitive advantage or necessity in many areas

2

Simple CPU Performance Model (1)

•  performance measured as execution time for given program

3

texecution = # instructions ⋅cycles

instruction ⋅timecycle

how to increase performance? reduce #instructions by increasing work done per instruction

a = a + b • c

more complex instructions

mac $a $b $c

sub-word parallelism, vector instructions

a3 a0 a1 a2 c[3..0] = a[3..0] + b[3..0]

b3 b0 b1 b2

a3+b3 a0+b0 a1+b1 a2+b2

+ + + +

= =

a

==b

c



4



how to increase performance? reduce #cycles/instructions (improve throughput)

a = b + c d = a + e

overlapped execution of instructions (pipelining)

begin execution of dependent instructions already before all dependencies are resolved

parallel execution of instructions (multiple issue/superscalar)

a = b + c d = e + f

concurrently start independent instructions



5



how to increase performance? execute cycles in shorter time

use leading semiconductor technology

e.g. Finfet transistor (image: Intel)

Riding the Waves of Moore's Law

•  despite all computer architecture innovations, Moore's law contributed most to performance increase –  more usable chip area –  faster clock speed –  exponential growth of

performance

•  but since early 2000's hardly any increase in single core performance due to –  power dissipation –  design complexity

6 source: Herb Sutter, Microsoft

Limits of Single-Core CPU Performance Growth

•  CPUs not tailored to particular task –  unknown applications –  unknown program flow –  unknown memory access patterns –  unknown parallelism

•  generality causes inefficiencies –  poor relation of active chip area to

total area –  excessive power consumption –  difficult design process

7

die shot of a 4-core AMD Barcelona CPU

chip area that contributes to actual computation

image: Anandtech

Scaling Up Performance Through Parallelism

•  exploit parallelism on many levels to increase performance –  core-level: vector instructions, multiple issue, out-of-order execution –  chip-level: several CPU cores in a single processor –  server-level: several CPUs per server –  data center-level: many servers with fast interconnect

•  promise –  improve performance by spreading work to many processors –  improve energy efficiency by using processors optimized for efficiency

instead of single core performance

8

1 4

16 64

256 1024 4096

16384 65536

262144 1048576

06/93

06/94

06/95

06/96

06/97

06/98

06/99

06/00

06/01

06/02

06/03

06/04

06/05

06/06

06/07

06/08

06/09

06/10

06/11

06/12

Number'of'CPU'cores'in'supercomputers''

TOP500 rank 1 TOP500 rank 100 Trend for rank 1 Trend for rank 100

Challenges for Exploiting Parallelism

•  creating programs that scale efficiently is challenging –  speedup fundamentally limited by non-parallelizable parts (Amdahl's law) –  communication and synchronization incur overheads –  poor tool support (analysis, parallelization, debugging)

9

Smax (ser,nCPU ) =1

ser +1− sernCPU

1

10

100

1000

1 4 16 64 256 1024 4096 16384

Max

imum

Ach

ieva

ble

Spee

dup

CPUs

Amdahl's Law

20% serial code10% serial code5% serial code1% serial code

0.5% serial code

application

Scaling up Performance Through Customization

10

general-purpose programmable CPU

architecture no customization

implementation on customized

programmable CPU architecture

instruction set number of execution units I/O (memory, interconnect)

custom computing engine implemented

in programmable hardware

complete customization of processing architecture

Example for a Custom Computing Engine

•  bioinformatics: substring search in genome sequences

11

DNA sequence data

G A T C

=

&

'A'

A

=

'T'

A

=

'C'

=

'A'

match'A'

match'T'

match'C'

match'A'

match 'ATCA'

no

DNA sequence data

G A T C A

=

&

'A'

A

=

'T'

-

=

'C'

=

'A'

match'A'

match'T'

match'C'

match'A'

match 'ATCA'

no

DNA sequence data

G A T C A A -

=

&

'A'

-

=

'T'

-

=

'C'

=

'A'

match'A'

match'T'

match'C'

match'A'

match 'ATCA'

no

G

DNA sequence data

A T C A A - -

=

&

'A'

-

=

'T'

-

=

'C'

=

'A'

match'A'

match'T'

match'C'

match'A'

match 'ATCA'

no

DNA sequence data

G A T C A A

=

&

'A'

-

=

'T'

-

=

'C'

=

'A'

match'A'

match'T'

match'C'

match'A'

match 'ATCA'

no

DNA sequence data

G A T

=

&

'A'

C

=

'T'

A

=

'C'

=

'A'

match'A'

match'T'

match'C'

match'A'

match 'ATCA'

yes

application-specific interconnect

data stream computing and pipelining

custom operations

parallel operations

wide data paths

custom data types

Custom Computing Technology

•  reconfigurable hardware architecture –  software programmable processing blocks and –  software programmable interconnect –  massively parallel

12

X

X

X

X

X

X

X X X

LUT FF DSPoperation

on-chip SRAM

ALU

REGs

FPGA, fine grained (bit oriented)

coarse grained reconfigurable array (word oriented)

Success Stories and Challenges

•  high speedups for many domains have been demonstrated, e.g. –  computational finance option pricing

!  5x performance and 26x energy efficiency over CPU [1] !  3x performance and 25x energy efficiency over GPU [1]

–  sparse matrix CG solver !  20-40x performance over CPU [2]

–  3D convolution !  70x performance over CPU, 14x over GPU [2]

–  molecular dynamics !  80x performance over NAMD on single core CPU [3]

•  challenges –  creating highly efficient custom computing engines requires deep

hardware and application knowledge –  high-level design processes are still subject of research

13

[1] A. H. T. Tse, D. Thomas, and W. Luk. Design exploration of quadrature methods in option pricing. In IEEE Trans. on Very Large Scale Integration (VLSI) Systems, volume 20, pages 818–826. IEEE, May 2011. [2] O. Lindtjorn, R. G. Clapp, O. Pell, O. Mencer, M. J. Flynn, and H. Fu. Beyond traditional microprocessors for geoscience high-performance computing applications. IEEE Micro, 31(2):41–49, Mar.–Apr. 2011. [3] M. Chiu and M. C. Herbordt. Molecular dynamics simulations on high-performance reconfigurable computing systems. ACM Trans. on Reconfigurable Technology and Systems, 3:23:1–23:37, Nov. 2010.

Current Parallel Computing Architectures

multi-core CPU many-core CPU field programmable gate arrays (FPGA)

graphics processing units (GPU)

cores ~10 ~100 ~1000 ~100'000

core complexity

computation model

MIMD + SIMD data-flow SIMD

parallelism thread and data parallel data parallel arbitrary

14

memory model shared distributed

complex (opt. for single-thread performance) simple

power 150W 200W 250W 50W

Case Study: Computational Nanophotonics

•  test case: microdisk cavity in perfect metallic environment –  well studied nanophotonic device –  point-like time-dependent source (optical dipole) –  known analytic solution (whispering gallery modes)

•  target system –  Maxeler MPC-C system, 2 Xeon X5650, 6 core, 2.7GHz –  4 MAX3 FPGA cards (Xilinx Virtex-6 SX475T), 48GB SDRAM

15

vacuum

perfect metal

experimental setup: microdisk cavity

source

result: energy density

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

public class Mav_kernel extends Kernel {

public Mav_kernel (KernelParameters parameters) {

super(parameters);

HWVar x = io.input(“A“, hwFloat(8, 24));

HWVar prev = stream.offset(x, -1);

HWVar next = stream.offset(x, 1);

HWVar sum = prev + x + next;

HWVar result = sum / 3;

io.output(“B“, result, hwFloat(8, 24));

}

}

Maxeler Data Flow Computing (1)

•  integrated custom computing platform for HPC –  tightly integrated hard- and software –  high-level Java-based specification –  FPGA internals and tools hidden from developer –  suitable for streaming applications

A

B

-1 +1

+

+

/ 3

example: Maxeler

b[i] = (a[i-2]+a[i-1]+a[i]) / 3

16

Maxeler Data Flow Computing (2)

•  architecture and design flow

17

x86%CPU%x86%CPU%x86%CPU%

MaxIDE'

Original%applica/on%

(.c,%.f)%

Applica/on%Kernel(s)%(.maxj)%

Manager%Configura/on%

(.maxj)%

MAXCompiler'Hardware'build'or'Simula?on'

HW%Accelerator%

(.max)%

Compiler'/'Linker'

Computa/onally%%%

intensive%components%

x86'CPU'

DRAM%

DFE' DFE' DFE' DFE'

PCI%Express%

MAXRing%

Mem

ory'

Controller'

PCIe'

DRAM%

PCIe%

MPCFC'PlaGorm

'Maxeler'Design'Flow

'

Finite Difference Time Domain Method (FDTD)

•  numerical method for solving Maxwell's equations •  iterative algorithm, computes propagation of fields for fixed time

step •  stencil computation on regular grid

–  same operations for each grid point –  fixed local data access pattern –  simple arithmetic

•  difficult to achieve high performance –  hardly any data reuse –  few operations per data

18

FDTD update equations (for one time step)

Ex'[x,y] = ca * Ex[x,y] + cb * (Hz[x,y] - Hz[x,y-1]); !Ey'[x,y] = ca * Ey[x,y] + cb * (Hz[x-1,y] - Hz[x,y]); !Hz'[x,y] = da * Hz[x,y] + db * (Ex[x,y+1] - Ex[x,y] + Ey'[x,y] – Ey'[x+1,y]);!

Translating FDTD to Stream Computing

19

size

y

2D arrays are streamed as 1D stream translate [x,y] to stream offsets

foreach (x,y) {! Ex'[x,y] = ca * Ex[x,y] + cb * (Hz[x,y] - Hz[x,y-1]); ! Ey'[x,y] = ca * Ey[x,y] + cb * (Hz[x-1,y] - Hz[x,y]); ! Hz'[x,y] = da * Hz[x,y] + db * (Ex[x,y+1] - Ex[x,y] + ! Ey'[x,y] – Ey'[x+1,y]);!}!

HWVar ex = io.input("iEx", hwFloat(11, 53));!HWVar ey = io.input("iEy", hwFloat(11, 53));!!HWVar hz_west = stream.offset(hz, -1 );!HWVar hz_north = stream.offset(hz, -size);!!exnext = (ca * ex) + (cb*(hz - hz_west));!eynext = (ca * ey) + (cb*(hz_north - hz));!!io.output("oEx", exresult, hwFloat(11, 53));!io.output("oEy", eyresult, hwFloat(11, 53));!

1 0 3 2 N

4

W E S

x

center Ex[x,y] → ex[i] north Ex[x-1,y] → ex[i-size] west Ex[x,y-1] → ex[i-1] east Ex[x-1,y] → ex[i+1] south Ex[x+1,y] → ex[i+size]

FDTD pseudo code

excerpt from MaxJ code for creating a Maxeler data flow engine

Dataflow Engine for 2D FDTD simulation

20

update Ex[i]

update Ey[i]

update Hz[i]

check if currently updated point i is the source or inside boundary

compute result (energy density)

chose computed update or boundary value

!

!"

!#

!

" "

"

!"

"

# #

$ $

!!

#

%#

"

#

%"

"

!"# !$#

%&#

!" %& !$

&''''''''( &''''''''( &''''''''(

'()*+,,-./01

'()*+,,-./21

)*+,%"-.

/*0,12

-"%

%3

"

3''''''''&

#

45"67

8*+-!7

/*0,12

45"67

6*+-!7

&''''''''(

&9&

#

%34

$

$

Performance Evaluation 2D FDTD

•  implementation –  uses a single DFE card

(FPGA utilization ~80%) –  data flow engine replicated

in 15 pipeline stages –  double precision

floating-point accuracy

•  results –  CPU performance break in at

218 grid points (working set size > cache size)

–  for large problems FPGA achieves a 2.4x speedup over parallel CPU implementation

21

0

200

400

600

800

1000

1200

1400

1600

1800

214 216 218 220 222 224M

cells

/sGrid points

Maxeler 2D, 1 DFE, 15 Pipeline StagesOMP 8 ThreadsOMP 8 Threads, optimized Blocking

Performance Evaluation 3D FDTD

•  implementation –  uses MaxGenFD finite-

difference toolkit for simplified specification

–  single precision floating-point accuracy

–  uses 4 DFE cards in parallel

•  results –  91x faster than single core version –  7.5x faster then optimized multi-core version –  outperforms fastest GPU solvers by 2x per device (at about ~1/3 power

consumption) –  currently the fastest published FPGA-accelerated FDTD solver

•  more details in:

22

0

200

400

600

800

1000

1200

1400

212 215 218 221 224 227 230

Mce

lls/s

Grid points

MaxGenFD 3D, 2 DFEs, 4 PipelinesMaxGen 3D, 4 DFEs, 4 PipelinesSingleCoreOMP-8ThreadsOMP-24Threads

H. Giefers, C. Plessl, and J. Förstner. Accelerating finite difference time domain simulations with reconfigurable dataflow computers. In Proc. Int. Workshop on Highly Efficient Accelerators and Reconfigurable Technologies (HEART), June 2013. Accepted for publication.

Conclusions and Outlook

•  technological and economical reasons caused single CPU core performance and efficiency to stagnate

•  parallelism and customization are two trends to address the desire for more performance

•  opportunities and challenges for computational sciences –  unprecedented levels of performance become affordable –  but programming has become more difficult –  the times where computer architecture could be considered a black-box

are gone (at least for the moment)

•  many research opportunities for making these new architecture easier to use by non computer scientists

23

Questions and Contact

•  Questions?

•  Contact information

24

Jun.-Prof. Dr. Christian Plessl [email protected] University of Paderborn Department of Computer Science

Documents

Parallelism and Customization: Trends in Computer ...simgpu2013.complexity-coventry.org/fileadmin/media/pdf/plessl.pdf · Riding the Waves of Moore's Law • despite all computer