Multi-Core-Architectures for Numerical Simulation€¦ · 1 Multi-Core-Architectures for Numerical Simulation Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg

1

Multi-Core-Architecturesfor Numerical Simulation

Lehrstuhl für Informatik 10 (Systemsimulation)

Universität Erlangen-Nürnberg

www10.informatik.uni-erlangen.de

Siemens Simulation Center

11. November 2009

H. Köstler, J. Habich, J. Götz, M. Stürmer, S. Donath, T. Gradl, D. Ritter,

C. Feichtinger, K. Iglberger (LSS Erlangen und RRZE)

U. Rüde (LSS Erlangen, [email protected])

In collaboration with RRZE and many more

2

OverviewIntro

Who we are

How fast are computers today?

Technological Trends

GPUs, Cell, and others

Example: Flow Simulation with Lattice Boltzmann Methods

Computational Haemodynamics using the PlayStation

Conclusions

3

The LSS Mission

Development and Analysis of Computer Methodsfor Applications in Science and Engineering

Applications fromPhysical and Engineering

Sciences

ComputerScience

Mathematics

LSS

4

Who is at LSS (and does what?)

• C. Feichtinger

• S. Donath

• C. Mihoubi

• J. Götz

• S. Ganguly

• K. Pickl

• S. Bogner

• !"##"#$%"&'(")"#'$

Alumni

Prof. G. Horton (Univ. of Magdeburg)

Prof. El Mostafa Kalmoun (Cadi Ayyad University, Marocco)

Dr. M. Kowarschik (Siemens Health Care)

Dr. M Mohr (Geophysik, TU München)

Dr. F. Hülsemann (EDF, Paris)

Dr. B. Bergen (Los Alamos, USA)

Dr. N. Thürey (ETZH Zürich)

Dr. J. Härdtlein (Bosch GmbH)

C. Möller (Navigon)

Dr. U. Fabricius (Elektrobit)

Dr. Th. Pohl (Siemens Health Care)

J. Treibig (RRZE)

C. Freundl (YAGER Development)

Laser Simulation

Prof. Dr. C. Pflaum

Supercomputing

J. Götz

Numerical Algorithms

H. Köstler

Complex Flows

K. Iglberger

B. Berneker

M. Wohlmuth

C. Jandl

Kai Hertel

J. Werner

T. Gradl

M. Stürmer

F. Deserno

D. Ritter

B. Gmeiner

S. Geißelsöder

• T. Dreher

• Dr. W. Degen

• T. Preclik

• D. Bartuschat

• S. Strobl

• Li Yi

5

How much is a PetaFlops?106 = 1 MegaFlops: Intel 486

33MHz PC (~1989)

109 = 1 GigaFlops: Intel Pentium III

1GHz (~2000)

If every person on earth does one operation every 6 seconds, all humans together have 1 GigaFlops performance (less than a current laptop)

1012= 1 TeraFlops: HLRB-I

1344 Proc., ~ 2000

1015= 1 PetaFlops

>100 000 Proc. Cores

Roadrunner/Los Alamos: Jun 2008

• If every person on earth runs a 486 PC, we all together have an aggregate Performance of 6 PetaFlops.

HLRB-II: 63 TFlops

HLRB-I: 2 TFlops

6

Where Does Computer Architecture Go?

Computer architects have capitulated: It may not be possible anymore to exploit progress in semiconductor technology for automatic performance improvements

Even today a single core CPU is a highly parallel system:

superscalar execution, complex pipeline, ... and additional tricks

Internal parallelism is a major reason for the performance increases until now, but ...

There is a limited amount of parallelism that can be exploited automatically

Multi-core systems concede the architects´ defeat:

Architects fail to build faster single core CPUs given more transistors

Clock rate increases only slowly (due to power considerations)

Therefore architects have started to put several cores on a chip:

programmers must use them directly

7

What are the consequences?

For the application developers “the free lunch is over”

Without explicitly parallel algorithms, the performance potential cannot be used any more

it will become increasingly important to use instruction level parallelism (such as vector units)

For perfromance critical applications:

CPUs will have 2, 4, 8, 16, ..., 128, ..., ??? cores - maybe sooner than we are ready for this

In the high end will have to deal with systems with millions of cores

8

Trends in Computer Architecture

On Chip Parallelism for everyone

instruction level

SIMD-like vectorization

multicore (with caches or local memory)

Off Chip parallelism

for large scale parallel systems

Accelerator hardware

GPUs

Cell processor

Limits to clock rate

Limits to memory bandwidth

Limits to memory latency

Multi-Core Aktivitäten am LSS

Architekturen

IBM Cell

GPU

• Nvidia

• AMD/ATI

Konventionelle Mehrkernarchitekturen (Intel, AMD)

Anwendungen Finite Elemente - PDE - Mehrgitterverfahren (Strömungslöser LBM-Verfahren

Bildverarbeitung, medizintechnische Anwendungen

3D-Realzeitsimulation für industrielle Steuerung

Siehe Papers, Berichte, Master- und Bachelorarbeiten:

http://www10.informatik.uni-erlangen.de/Publications/

9

10

Multi Core Architectures

IBM-Sony-Toshiba Cell Processor

GPU: Nvidia or AMD/ATI

11

The STI Cell Processor

hybrid multicore processor based on IBM Power architecture

(simplified) PowerPC core

runs operating system

controls execution of programs

multiple co-processors (8, on Sony PS3 only 6 available)

operate on fast, private on-chip memory

optimized for computation

vectorization: „float4“ data type

DMA controller copies data from/to main memory

• multi-buffering can hide main memory latencies completely for streaming-like applications

• loading local copies has low and known latencies

memory with multiple channels and banks can be exploited if many memory transactions are in-flight

12

IBM Cell Processor Available cell systems:

Roadrunner

Blades

Playstation 3

Cell Architecture: 9 cores on a chip

13

14

GPUs

massively parallel SIMD-like execution on several hundred compute units

typical performance values (Nvidia Fermi, soon):

2.7 TFlop single precision possible

630 Gflop double precision

4+x GByte memory „on board“

150+x GByte/sec memory bandwidth

additionally vectorization in „warps“ (16 floats)

ATI Radeon HD 4870

Costs: 150 !

Interface: PCI-E 2.0 x16

Shader Clock: 750 MHz

Memory Clock: 900 MHz

Memory Bandwidth: 115 GB/s

FLOPS: 1200 GFLOPS

Max Power Draw: 160 W

Framebuffer: 1024 MB

Memory Bus: 256 bit

Shader Processors: 800

15

Nvidia GeForce GTX 295

Costs: 450 !

Interface: PCI-E 2.0 x16

Shader Clock: 1242 MHz

Memory Clock: 999 MHz

Memory Bandwidth: 2x112 GB/s

FLOPS: 2x894 GFLOPS

Max Power Draw: 289 W

Framebuffer: 2x896 MB

Memory Bus: 2x448 bit

Shader Processors: 2x240

16

GPU: AMD Stream Processor

17

AMD Stream Architecture (cont‘d)

18

ATI Radeon 3870(RV670) / Firestream 9170

19

Example 1: Flow Simulation on Cell

20

LBM Optimized for Cell

memory layout

optimized for DMA transfers

information propagating between patches is reordered on the SPE and stored sequentially in memory for simple and fast exchange

code optimization

kernels hand-optimized in assembly language

SIMD-vectorized streaming and collision

branch-free handling of bounce-back boundary conditions

21

Simulation ofMetal Foams

Free Surface Flows

Applications:

Engineering: metal foam simulations

Computer graphics: special effects

Based on LBM:

Mesoscopic approach to solving the NS equations

Good for complex boundary conditions

Details: D3Q19 model, BGK collision and grid compression

22

Performance Results

0

12,5

25,0

37,5

50,0

Xeon 5160 PPE SPE*

49,0

2,04,8

10,4

LBM performance on a single core (8x8x8 channel flow)

*on Local Store without DMA transfers

straight-forward C codeSIMD-optimized assembly

23

Performance Results

30,0

47,5

65,0

82,5

100,0

1 2 3 4 5 6

95949493

81

42

24

Performance Results

0

12,5

25,0

37,5

50,0

Xeon 5160* Playstation 3

43,8

11,7

21,1

9,1

1 core

1 CPU*performance optimized code by LB-DC

25

Programming the Cell-BE

the hard way

control SPEs using management libraries

issue DMAs by language extensions

do address calculations manually

exchange main memory addresses, array sizes etc.

synchronization using mailboxes, signals or libraries

frameworks

Accelerated Library Framework (ALF) and Data, Communication, and Synchronization (DaCS) by IBM

Rapidmind SDK

accelerated libraries

single-source-compiler

IBM’s xlc-cbe-sse, uses OpenMP

26

Naive SPU implementation: A[] = A[]*cvolatile vector float ls_buffer[8] __attribute__((aligned(128)));

void scale( unsigned long long gs_buffer, // main memory address of vector

int number_of_chunks, // number of chunks of 32 floats

float factor ) { // scaling factor

vector float v_fac = spu_splats(factor); // create SIMD vector with all

// four elements being factor

for ( int i = 0 ; i < number_of_chunks ; ++i ) {

mfc_get( ls_buffer , gs_buffer , 128 , 0 ,0,0); // DMA reading i-th chunk

mfc_write_tag_mask( 1 << 0 ); // wait for DMA...

mfc_read_tag_status_all(); // ...to complete

for ( int j = 0 ; j < 8 ; ++j )

ls_buffer[j] = spu_mul( ls_buffer[j] , v_fac ); // scale local copy using SIMD

mfc_put( ls_buffer ,gs_buffer , 128 , 0 ,0,0); // DMA writing i-th chunk

mfc_write_tag_mask( 1 << 0 ); // wait for DMA...

mfc_read_tag_status_all(); // ...to complete

gs_buffer += 128; // incr. global store pointer

} }

27

Remove latencies using multi-buffering

mfc_get( ls_buffer[0] , gs_buffer , 128 , 0 ,0,0); // request first chunk

for (int i = 0; i < number_of_chunks; ++i) {

int cur = ( i ) % 3; // buffer no. and DMA tag for i-th !chunk

int next = (i+1) % 3; // " for (i-2)-th and (i+1)-th chunk

if (i < number_of_chunks-1) {

mfc_write_tag_mask( 1 << next ); // make sure the (i-2)-th chunk...

mfc_read_tag_status_all(); // ...has been stored

mfc_get( ls_buffer[next] , gs_buffer+128 , 128 , next ,0,0); // request (i+1)-th chunk

}

mfc_write_tag_mask( 1 << cur ); // wait until i-th chunk...

mfc_read_tag_status_all(); // ...is available

for (int j = 0; j < 8; ++j) ls_buffer[cur][j] = spu_mul(ls_buffer[cur][j],v_fac);

mfc_put( ls_buffer[cur] , gs_buffer , 128 , cur ,0,0);// store i-th chunk

gs_buffer += 128;

}

mfc_write_tag_mask( 1 | 2 | 4 ); // wait for any...

mfc_read_tag_status_all(); // outstanding DMA

volatile vector float ls_buffer[3][8]

__attribute__((aligned(128)));

...

28

Example 2: LBM on Graphics Cards

Johannes Habich, M.Sc. [email protected] 29

OpenMP vs. CUDA

! Alignment constraints must be met

! No cache and cache lines must be considered

0

3

1

4

2

5

Thread 2

CUDA! Divide domain into small pieces

Thread 1

Thread 1

Thread 0

Thread 0

Thread 2

Block 1

Block 2

0

1

2

3

4

5

Thread 0

Thread 1

Thread 2

OpenMP! Divide domain to huge chunks

Johannes Habich, M.Sc. [email protected] 30

Performance Results for GPUs

! Up to 6 times CPU performance if well implemented

! Less than CPU performance if straightforwardly implemented

! Use padding to circumvent performance breakdown (80% loss)

Arbitrary Geometries

31

Part IV

Conclusions

Computer Science X - System Simulation Group Markus Stürmer ([email protected])

Evolution of processors: Improvements

Std.-CPU CBEA GPU

pipelining

superscalar execution

out-of-order

wider buses

SIMD

multithreading

multiprocessing

caches

hardware prefetcher

resource virtualization

! ! !

! ! "

! " "

! ! !

! ! !

! ! / " !

! ! !

! ! / " !

! ? / " "

" / ! !

! ! / " "

instruction

data

thread

transfer

local storage "

33

Conclusions

There is no way around Multi-Core architectures in the forseeable future

Multi-Core Accelerators have excellent performance potential, but ....

they are not suitable for all algorithms

they are tricky to program

many results published in the literature are too optimistic

• people tend to make unfair comparisons

accellerators and their programming environments are changing very quickly

what about robustness, numerical accuracy, etc?

there is a good chance that we will see these technologies in the future, but likely in different guise

34

35

Thank you for your attention

Questions?

Slides, papers, reports, thesis, animations available for

download at: www10.informatik.uni-erlangen.de

Documents

Multi-Core-Architectures for Numerical Simulation€¦ · 1 Multi-Core-Architectures for Numerical Simulation Lehrstuhl für Informatik 10 (Systemsimulation) Universität Erlangen-Nürnberg