21
260 // Computing 12.1 Motivation 12 Parallel Computing 12.1 Motivation Why do we need computing power in an increasing manner? Much-cited legend: In 1947 computer engineer Howard Aiken said that USA will need in the future at most 6 computers! Gordon Moore’s (founder of Intel) law: (1965: the number of switches doubles every second year ) 1975: - refinement of the above: The number of switches on a CPU doubles every 18 months Until 2020 or 2030 we would reach in such a way to the atomic level (or quantum computer)

260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

260 // Computing 12.1 Motivation

12 Parallel Computing

12.1 MotivationWhy do we need computing power in an increasing manner?Much-cited legend: In 1947 computer engineer Howard Aiken said that USA will

need in the future at most 6 computers!

Gordon Moore’s (founder of Intel) law:(1965: the number of switches doubles every second year )1975: - refinement of the above: The number of switches on a CPU doubles

every 18 months

Until 2020 or 2030 we would reach in such a way to the atomic level (or quantumcomputer)

Page 2: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

261 // Computing 12.1 Motivation

Flops:

first computers 102 100 Flopsdesktop computers today 109 Gigaflops (GFlops)

supercomputers nowadays 1012 Teraflops (TFlops)the aim of today 1015 Petaflops (PFlops)

next step 1018 Exaflops (EFlops)

• How can parallel processing help achieving these goals?

• Why should we talk about it at the Scientific Computin course?

Example 12.1. Why speeds of order 1012 are not enough?Weather forecast in Europe for next 48 hoursfrom sea lever upto 20km– need to solve a PDE in 3D (xyz and t)The volume of the region: 5000∗4000∗20 km3.Stepsize in xy-plane 1km Cartesian mesh z-direction: 0.1km. Timesteps: 1min.

Around 1000 flop per each timestep. As a result, ca

Page 3: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

262 // Computing 12.1 Motivation

5000∗4000∗20km3×10 rectangles per km3 = 4∗109meshpoints

and

4∗109 meshpoints ∗48∗60timesteps×103 flop≈ 1.15∗1015 flop.

If a computer can do 109 flops, it will take around

1.15∗1015 flop / 109 flop= 1.15∗106 seconds ≈ 13 days !!

But, with 1012 flops,1.15∗103 seconds < 20min.

Not hard to imagine “small” changes in given scheme such that 1012 flops not enough:

Page 4: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

263 // Computing 12.1 Motivation

• Reduce the mesh stepsize to 0.1km in each directions and timestep to 10seconds and the total time would grow from

20min to 8days.

• We could replace the Europe with the whole World model (the area: 2 ∗107km2 −→ 5∗108km2) and to combine the model with an Ocean model.

Therefore, we must say: The needs of science and technology grow faster thanavailable possibilities, need only to change ε and h to get unsatisfied again!

But again, why do we need a parallel computer to achieve this goal?

Page 5: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

264 // Computing 12.1 Motivation

Example 12.2Have a look at the following piece of code:� �

do i=1,1000000000000

z(i)=x(i)+y(i) # ie. 3∗1012 memory accesses

end do�Assuming, that data is traveling with the speed of light 3 ∗ 108m/s, for finishing

the operation in 1 second, the average distance of a memory cell must be:

r = 3∗108m/s∗1s3∗1012 = 10−4m.

Typical computer memory – a rectangular mesh. The length of each edge would be

2∗10−4m√3∗1012 ≈ 10−10m,

– the size of a small atom!

Page 6: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

265 // Computing 12.1 Motivation

Why parallel computing not yet the only way of doing it?Three main reasons: hardware , algorithms and software

Hardware: speed of

• networking

• peripheral devices

• memory access

do not cope with the speed growth in pro-cessor capacity.

Algorithms: An enormous number ofdifferent algorithms exist but

• problems start with their im-plementation on some real lifeapplication

Software: development in progress;

• no good enough automaticparallelisers

• everything done as handwork

• no easily portable libraries for dif-ferent architectures

• does the paralleising effort pay offat all?

Page 7: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

266 // Computing 12.2 Classification of Parallel Computers

12.2 Classification of Parallel Computers

• Architecture

– Single processor computer

– Multicore processor

– distributed system

– shared memory system

• Network topology

– ring, array, hypercube, . . .

• Memory access

– shared , distributed , hybrid

• operating system

– UNIX,

– LINUX,

– (OpenMosix)

– WIN*

• scope

– supercomputing

– distributed computing

– real time sytems

– mobile systems

– grid and cloud computing

– etc

Page 8: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

267 // Computing 12.3 Parallel Computational Models

But impossible to ignore (implicit or explicit) parallelism in a computer or a setof computers

List of top parallel computer systems (http://www.top500.org).

12.3 Parallel Computational Models

• Data parallelism: e.g. vector processor

• Control Parallelism: e.g. shared memory

– Each process has access to all data. Access to this data is coordinated bysome form of locking procedure

• Message Passing: e.g. MPI

– Independent processes with independent memory. Communication be-tween the processes via sending messages.

Page 9: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

268 // Computing 12.3 Parallel Computational Models

• Other models:

– Hybrid versions of the above models,

– remote memory operations (“one-sided” sends and receives)

13 Parallel programming models

Many different models for expressing parallelism in programming languages

• Actor model

– Erlang

– Scala

• Coordination languages

– Linda

• CSP-based (Communicating Se-quential Processes)

– FortranM

– Occam

Page 10: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

269 // Computing 12.3 Parallel Computational Models

• Dataflow

– SISAL (Streams and Itera-tion in a Single AssignmentLanguage)

• Distributed

– Sequoia

– Bloom

• Event-driven and hardware de-scription

– Verilog hardware descriptionlanguage (HDL)

• Functional

– Concurrent Haskell

• GPU languages

– CUDA

– OpenCL

– OpenACC

– OpenHMPP (HMPP for Hy-brid Multicore Parallel Pro-gramming)

• Logic programming

– Parlog

• Multi-threaded

– Clojure

Page 11: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

270 // Computing 12.3 Parallel Computational Models

• Object-oriented

– Charm++

– Smalltalk

• Message-passing

– MPI

– PVM

• Partitioned global address space(PGAS)

– High Performance Fortran(HPF)

Page 12: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

271 // programming models 13.1 HPF

13.1 HPF

Partitioned global address space parallel programming modelFortran90 extension

• SPMD (Single Program Multiple Data) model

• each process operates with its own part of data

• HPF commands specify which processor gets which part of the data

• Concurrency is defined by HPF commands based on Fortran90

• HPF directives as comments:!HPF$ <directive>

Page 13: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

272 // programming models 13.1 HPF

HPF example: matrix multiplication� �PROGRAM ABmult

IMPLICIT NONEINTEGER , PARAMETER : : N = 100INTEGER , DIMENSION (N,N) : : A, B , CINTEGER : : i , j

! HPF$ PROCESSORS s q u a r e ( 2 , 2 )! HPF$ DISTRIBUTE (BLOCK,BLOCK) ONTO s q u a r e : : C! HPF$ ALIGN A( i , * ) WITH C( i , * )! r e p l i c a t e c o p i e s o f row A ( i , * ) on to proc . s which compute C( i , j )! HPF$ ALIGN B( * , j ) WITH C( * , j )! r e p l i c a t e c o p i e s o f c o l . B ( * , j ) ) on to proc . s which compute C( i , j )

A = 1B = 2C = 0DO i = 1 , N

DO j = 1 , N! A l l t h e work i s l o c a l due t o ALIGNsC( i , j ) = DOT_PRODUCT(A( i , : ) , B ( : , j ) )

Page 14: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

273 // programming models 13.1 HPF

END DOEND DOWRITE( * , * ) CEND�

HPF programming methodology

• Need to find balance between concurrency and communication

• the more processes the more communication

• aiming to

– find balanced load based from the owner calculates rule

– data locality

Easy to write a program in HPF but difficult to gain good efficiencyProgramming in HPF technique is more or less like this:

Page 15: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

274 // programming models 13.1 HPF

1. Write a correctly working serial program, test and debug it

2. add distribution directives introducing as less as possible communication

Page 16: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

275 // programming models 13.2 OpenMP

13.2 OpenMP

OpenMP tutorial (http://www.llnl.gov/computing/tutorials/openMP/)

• Programming model based on thread parallelism

• Helping tool for a programmer

• Based on compiler directivesC/C++, Fortran*

• Nested Parallelism is supported(though, not all implementations support it)

• Dynamic threads

• OpenMP has become a standard

Page 17: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

276 // programming models 13.2 OpenMP

• Fork-Join model

Page 18: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

277 // programming models 13.2 OpenMP

OpenMP Example: Matrix Multiplication:� �! h t t p : / / www. l l n l . gov / comput ing / t u t o r i a l s / openMP / e x e r c i s e . h tm lC******************************************************************************C OpenMp Example − Ma t r ix M u l t i p l y − F o r t r a n V e r s i o nC FILE : omp_mm . fC DESCRIPTION :C D e m o n s t r a t e s a m a t r i x m u l t i p l y u s i n g OpenMP . Threads s h a r e row i t e r a t i o n sC a c c o r d i n g to a p r e d e f i n e d chunk s i z e .C LAST REVISED : 1 / 5 / 0 4 B l a i s e BarneyC******************************************************************************

PROGRAM MATMULT

INTEGER NRA, NCA, NCB, TID , NTHREADS, I , J , K, CHUNK,+ OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM

C number of rows in m a t r i x APARAMETER (NRA=62)

C number of columns in m a t r i x APARAMETER (NCA=15)

C number of columns in m a t r i x BPARAMETER (NCB=7)

REAL*8 A(NRA,NCA) , B(NCA,NCB) , C(NRA,NCB)

Page 19: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

278 // programming models 13.2 OpenMP

C S e t loop i t e r a t i o n chunk s i z eCHUNK = 10

C Spawn a p a r a l l e l r e g i o n e x p l i c i t l y s c o p i n g a l l v a r i a b l e s!$OMP PARALLEL SHARED(A, B , C ,NTHREADS,CHUNK) PRIVATE ( TID , I , J ,K)

TID = OMP_GET_THREAD_NUM( )IF ( TID . EQ . 0 ) THEN

NTHREADS = OMP_GET_NUM_THREADS( )PRINT * , ’ S t a r t i n g m a t r i x m u l t i p l e example wi th ’ , NTHREADS,

+ ’ t h r e a d s ’PRINT * , ’ I n i t i a l i z i n g m a t r i c e s ’

END IF

C I n i t i a l i z e m a t r i c e s!$OMP DO SCHEDULE( STATIC , CHUNK)

DO 30 I =1 , NRADO 30 J =1 , NCA

A( I , J ) = ( I−1) +( J−1)30 CONTINUE

!$OMP DO SCHEDULE( STATIC , CHUNK)DO 40 I =1 , NCA

DO 40 J =1 , NCBB( I , J ) = ( I−1) * ( J−1)

40 CONTINUE

Page 20: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

279 // programming models 13.2 OpenMP

!$OMP DO SCHEDULE( STATIC , CHUNK)DO 50 I =1 , NRA

DO 50 J =1 , NCBC( I , J ) = 0

50 CONTINUE

C Do m a t r i x m u l t i p l y s h a r i n g i t e r a t i o n s on o u t e r l oopC D i s p l a y who does which i t e r a t i o n s f o r d e m o n s t r a t i o n p u r p o s e s

PRINT * , ’ Thread ’ , TID , ’ s t a r t i n g m a t r i x m u l t i p l y . . . ’!$OMP DO SCHEDULE( STATIC , CHUNK)

DO 60 I =1 , NRAPRINT * , ’ Thread ’ , TID , ’ d i d row ’ , I

DO 60 J =1 , NCBDO 60 K=1 , NCA

C( I , J ) = C( I , J ) + A( I ,K) * B(K, J )60 CONTINUE

C End of p a r a l l e l r e g i o n!$OMP END PARALLEL

C P r i n t r e s u l t sPRINT * , ’ ****************************************************** ’PRINT * , ’ R e s u l t Ma t r i x : ’DO 90 I =1 , NRA

Page 21: 260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

280 // programming models 13.2 OpenMP

DO 80 J =1 , NCBWRITE( * , 7 0 ) C( I , J )

70 FORMAT(2 x , f8 . 2 , $ )80 CONTINUE

PRINT * , ’ ’90 CONTINUE

PRINT * , ’ ****************************************************** ’PRINT * , ’ Done . ’

END�