Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Parallelism and Customization: Trends in Computer Architectures for
Accelerating Scientific Computing
Jun.-Prof. Dr. Christian Plessl
2nd International Symposium "Computer Simulations on GPU" (SimGPU 2013)
Freudenstadt, Germany
Custom Computing Group University of Paderborn
2013-05-28
Motivation: Increasing Demand for Computing
• computer simulation has been established as a standard method in many areas of science and engineering – e.g. computational fluid dynamics, structure simulation, propagation of
electromagnetic fields, ... – demand for finer temporal and spatial resolution or more complex models – in many cases compute bound
• new areas are addressed with different characteristics – e.g. bioinformatics, large scale molecular dynamics, system's biology, ... – compute but increasingly also memory bound (big data)
• availability of high performance computing resources is a competitive advantage or necessity in many areas
2
Simple CPU Performance Model (1)
• performance measured as execution time for given program
3
texecution = # instructions ⋅cycles
instruction ⋅timecycle
how to increase performance? reduce #instructions by increasing work done per instruction
a = a + b • c
more complex instructions
mac $a $b $c
sub-word parallelism, vector instructions
a3 a0 a1 a2 c[3..0] = a[3..0] + b[3..0]
b3 b0 b1 b2
a3+b3 a0+b0 a1+b1 a2+b2
+ + + +
= =
a
==b
c
Simple CPU Performance Model (2)
• performance measured as execution time for given program
4
texecution = # instructions ⋅cycles
instruction ⋅timecycle
how to increase performance? reduce #cycles/instructions (improve throughput)
a = b + c d = a + e
overlapped execution of instructions (pipelining)
begin execution of dependent instructions already before all dependencies are resolved
parallel execution of instructions (multiple issue/superscalar)
a = b + c d = e + f
concurrently start independent instructions
Simple CPU Performance Model (3)
• performance measured as execution time for given program
5
texecution = # instructions ⋅cycles
instruction ⋅timecycle
how to increase performance? execute cycles in shorter time
use leading semiconductor technology
e.g. Finfet transistor (image: Intel)
Riding the Waves of Moore's Law
• despite all computer architecture innovations, Moore's law contributed most to performance increase – more usable chip area – faster clock speed – exponential growth of
performance
• but since early 2000's hardly any increase in single core performance due to – power dissipation – design complexity
6 source: Herb Sutter, Microsoft
Limits of Single-Core CPU Performance Growth
• CPUs not tailored to particular task – unknown applications – unknown program flow – unknown memory access patterns – unknown parallelism
• generality causes inefficiencies – poor relation of active chip area to
total area – excessive power consumption – difficult design process
7
die shot of a 4-core AMD Barcelona CPU
chip area that contributes to actual computation
image: Anandtech
Scaling Up Performance Through Parallelism
• exploit parallelism on many levels to increase performance – core-level: vector instructions, multiple issue, out-of-order execution – chip-level: several CPU cores in a single processor – server-level: several CPUs per server – data center-level: many servers with fast interconnect
• promise – improve performance by spreading work to many processors – improve energy efficiency by using processors optimized for efficiency
instead of single core performance
8
1 4
16 64
256 1024 4096
16384 65536
262144 1048576
06/93
06/94
06/95
06/96
06/97
06/98
06/99
06/00
06/01
06/02
06/03
06/04
06/05
06/06
06/07
06/08
06/09
06/10
06/11
06/12
Number'of'CPU'cores'in'supercomputers''
TOP500 rank 1 TOP500 rank 100 Trend for rank 1 Trend for rank 100
Challenges for Exploiting Parallelism
• creating programs that scale efficiently is challenging – speedup fundamentally limited by non-parallelizable parts (Amdahl's law) – communication and synchronization incur overheads – poor tool support (analysis, parallelization, debugging)
9
Smax (ser,nCPU ) =1
ser +1− sernCPU
1
10
100
1000
1 4 16 64 256 1024 4096 16384
Max
imum
Ach
ieva
ble
Spee
dup
CPUs
Amdahl's Law
20% serial code10% serial code5% serial code1% serial code
0.5% serial code
application
Scaling up Performance Through Customization
10
general-purpose programmable CPU
architecture no customization
implementation on customized
programmable CPU architecture
instruction set number of execution units I/O (memory, interconnect)
custom computing engine implemented
in programmable hardware
complete customization of processing architecture
Example for a Custom Computing Engine
• bioinformatics: substring search in genome sequences
11
DNA sequence data
G A T C
=
&
'A'
A
=
'T'
A
=
'C'
=
'A'
match'A'
match'T'
match'C'
match'A'
match 'ATCA'
no
DNA sequence data
G A T C A
=
&
'A'
A
=
'T'
-
=
'C'
=
'A'
match'A'
match'T'
match'C'
match'A'
match 'ATCA'
no
DNA sequence data
G A T C A A -
=
&
'A'
-
=
'T'
-
=
'C'
=
'A'
match'A'
match'T'
match'C'
match'A'
match 'ATCA'
no
G
DNA sequence data
A T C A A - -
=
&
'A'
-
=
'T'
-
=
'C'
=
'A'
match'A'
match'T'
match'C'
match'A'
match 'ATCA'
no
DNA sequence data
G A T C A A
=
&
'A'
-
=
'T'
-
=
'C'
=
'A'
match'A'
match'T'
match'C'
match'A'
match 'ATCA'
no
DNA sequence data
G A T
=
&
'A'
C
=
'T'
A
=
'C'
=
'A'
match'A'
match'T'
match'C'
match'A'
match 'ATCA'
yes
application-specific interconnect
data stream computing and pipelining
custom operations
parallel operations
wide data paths
custom data types
Custom Computing Technology
• reconfigurable hardware architecture – software programmable processing blocks and – software programmable interconnect – massively parallel
12
X
X
X
X
X
X
X X X
LUT FF DSPoperation
on-chip SRAM
ALU
REGs
FPGA, fine grained (bit oriented)
coarse grained reconfigurable array (word oriented)
Success Stories and Challenges
• high speedups for many domains have been demonstrated, e.g. – computational finance option pricing
! 5x performance and 26x energy efficiency over CPU [1] ! 3x performance and 25x energy efficiency over GPU [1]
– sparse matrix CG solver ! 20-40x performance over CPU [2]
– 3D convolution ! 70x performance over CPU, 14x over GPU [2]
– molecular dynamics ! 80x performance over NAMD on single core CPU [3]
• challenges – creating highly efficient custom computing engines requires deep
hardware and application knowledge – high-level design processes are still subject of research
13
[1] A. H. T. Tse, D. Thomas, and W. Luk. Design exploration of quadrature methods in option pricing. In IEEE Trans. on Very Large Scale Integration (VLSI) Systems, volume 20, pages 818–826. IEEE, May 2011. [2] O. Lindtjorn, R. G. Clapp, O. Pell, O. Mencer, M. J. Flynn, and H. Fu. Beyond traditional microprocessors for geoscience high-performance computing applications. IEEE Micro, 31(2):41–49, Mar.–Apr. 2011. [3] M. Chiu and M. C. Herbordt. Molecular dynamics simulations on high-performance reconfigurable computing systems. ACM Trans. on Reconfigurable Technology and Systems, 3:23:1–23:37, Nov. 2010.
Current Parallel Computing Architectures
multi-core CPU many-core CPU field programmable gate arrays (FPGA)
graphics processing units (GPU)
cores ~10 ~100 ~1000 ~100'000
core complexity
computation model
MIMD + SIMD data-flow SIMD
parallelism thread and data parallel data parallel arbitrary
14
memory model shared distributed
complex (opt. for single-thread performance) simple
power 150W 200W 250W 50W
Case Study: Computational Nanophotonics
• test case: microdisk cavity in perfect metallic environment – well studied nanophotonic device – point-like time-dependent source (optical dipole) – known analytic solution (whispering gallery modes)
• target system – Maxeler MPC-C system, 2 Xeon X5650, 6 core, 2.7GHz – 4 MAX3 FPGA cards (Xilinx Virtex-6 SX475T), 48GB SDRAM
15
vacuum
perfect metal
experimental setup: microdisk cavity
source
result: energy density
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
public class Mav_kernel extends Kernel {
public Mav_kernel (KernelParameters parameters) {
super(parameters);
HWVar x = io.input(“A“, hwFloat(8, 24));
HWVar prev = stream.offset(x, -1);
HWVar next = stream.offset(x, 1);
HWVar sum = prev + x + next;
HWVar result = sum / 3;
io.output(“B“, result, hwFloat(8, 24));
}
}
Maxeler Data Flow Computing (1)
• integrated custom computing platform for HPC – tightly integrated hard- and software – high-level Java-based specification – FPGA internals and tools hidden from developer – suitable for streaming applications
A
B
-1 +1
+
+
/ 3
example: Maxeler
b[i] = (a[i-2]+a[i-1]+a[i]) / 3
16
Maxeler Data Flow Computing (2)
• architecture and design flow
17
x86%CPU%x86%CPU%x86%CPU%
MaxIDE'
Original%applica/on%
(.c,%.f)%
Applica/on%Kernel(s)%(.maxj)%
Manager%Configura/on%
(.maxj)%
MAXCompiler'Hardware'build'or'Simula?on'
HW%Accelerator%
(.max)%
Compiler'/'Linker'
Computa/onally%%%
intensive%components%
x86'CPU'
DRAM%
DFE' DFE' DFE' DFE'
PCI%Express%
MAXRing%
Mem
ory'
Controller'
PCIe'
DRAM%
PCIe%
MPCFC'PlaGorm
'Maxeler'Design'Flow
'
Finite Difference Time Domain Method (FDTD)
• numerical method for solving Maxwell's equations • iterative algorithm, computes propagation of fields for fixed time
step • stencil computation on regular grid
– same operations for each grid point – fixed local data access pattern – simple arithmetic
• difficult to achieve high performance – hardly any data reuse – few operations per data
18
FDTD update equations (for one time step)
Ex'[x,y] = ca * Ex[x,y] + cb * (Hz[x,y] - Hz[x,y-1]); !Ey'[x,y] = ca * Ey[x,y] + cb * (Hz[x-1,y] - Hz[x,y]); !Hz'[x,y] = da * Hz[x,y] + db * (Ex[x,y+1] - Ex[x,y] + Ey'[x,y] – Ey'[x+1,y]);!
Translating FDTD to Stream Computing
19
size
y
2D arrays are streamed as 1D stream translate [x,y] to stream offsets
foreach (x,y) {! Ex'[x,y] = ca * Ex[x,y] + cb * (Hz[x,y] - Hz[x,y-1]); ! Ey'[x,y] = ca * Ey[x,y] + cb * (Hz[x-1,y] - Hz[x,y]); ! Hz'[x,y] = da * Hz[x,y] + db * (Ex[x,y+1] - Ex[x,y] + ! Ey'[x,y] – Ey'[x+1,y]);!}!
HWVar ex = io.input("iEx", hwFloat(11, 53));!HWVar ey = io.input("iEy", hwFloat(11, 53));!!HWVar hz_west = stream.offset(hz, -1 );!HWVar hz_north = stream.offset(hz, -size);!!exnext = (ca * ex) + (cb*(hz - hz_west));!eynext = (ca * ey) + (cb*(hz_north - hz));!!io.output("oEx", exresult, hwFloat(11, 53));!io.output("oEy", eyresult, hwFloat(11, 53));!
1 0 3 2 N
4
W E S
x
center Ex[x,y] → ex[i] north Ex[x-1,y] → ex[i-size] west Ex[x,y-1] → ex[i-1] east Ex[x-1,y] → ex[i+1] south Ex[x+1,y] → ex[i+size]
FDTD pseudo code
excerpt from MaxJ code for creating a Maxeler data flow engine
Dataflow Engine for 2D FDTD simulation
20
update Ex[i]
update Ey[i]
update Hz[i]
check if currently updated point i is the source or inside boundary
compute result (energy density)
chose computed update or boundary value
!
!"
!#
!
" "
"
!"
"
# #
$ $
!!
#
%#
"
#
%"
"
!"# !$#
%&#
!" %& !$
&''''''''( &''''''''( &''''''''(
'()*+,,-./01
'()*+,,-./21
)*+,%"-.
/*0,12
-"%
%3
"
3''''''''&
#
45"67
8*+-!7
/*0,12
45"67
6*+-!7
&''''''''(
&9&
#
%34
$
$
Performance Evaluation 2D FDTD
• implementation – uses a single DFE card
(FPGA utilization ~80%) – data flow engine replicated
in 15 pipeline stages – double precision
floating-point accuracy
• results – CPU performance break in at
218 grid points (working set size > cache size)
– for large problems FPGA achieves a 2.4x speedup over parallel CPU implementation
21
0
200
400
600
800
1000
1200
1400
1600
1800
214 216 218 220 222 224M
cells
/sGrid points
Maxeler 2D, 1 DFE, 15 Pipeline StagesOMP 8 ThreadsOMP 8 Threads, optimized Blocking
Performance Evaluation 3D FDTD
• implementation – uses MaxGenFD finite-
difference toolkit for simplified specification
– single precision floating-point accuracy
– uses 4 DFE cards in parallel
• results – 91x faster than single core version – 7.5x faster then optimized multi-core version – outperforms fastest GPU solvers by 2x per device (at about ~1/3 power
consumption) – currently the fastest published FPGA-accelerated FDTD solver
• more details in:
22
0
200
400
600
800
1000
1200
1400
212 215 218 221 224 227 230
Mce
lls/s
Grid points
MaxGenFD 3D, 2 DFEs, 4 PipelinesMaxGen 3D, 4 DFEs, 4 PipelinesSingleCoreOMP-8ThreadsOMP-24Threads
H. Giefers, C. Plessl, and J. Förstner. Accelerating finite difference time domain simulations with reconfigurable dataflow computers. In Proc. Int. Workshop on Highly Efficient Accelerators and Reconfigurable Technologies (HEART), June 2013. Accepted for publication.
Conclusions and Outlook
• technological and economical reasons caused single CPU core performance and efficiency to stagnate
• parallelism and customization are two trends to address the desire for more performance
• opportunities and challenges for computational sciences – unprecedented levels of performance become affordable – but programming has become more difficult – the times where computer architecture could be considered a black-box
are gone (at least for the moment)
• many research opportunities for making these new architecture easier to use by non computer scientists
23
Questions and Contact
• Questions?
• Contact information
24
Jun.-Prof. Dr. Christian Plessl [email protected] University of Paderborn Department of Computer Science