260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000

260 // Computing 12.1 Motivation

12 Parallel Computing

12.1 MotivationWhy do we need computing power in an increasing manner?Much-cited legend: In 1947 computer engineer Howard Aiken said that USA will

need in the future at most 6 computers!

Gordon Moore’s (founder of Intel) law:(1965: the number of switches doubles every second year )1975: - refinement of the above: The number of switches on a CPU doubles

every 18 months

Until 2020 or 2030 we would reach in such a way to the atomic level (or quantumcomputer)


Flops:

first computers 102 100 Flopsdesktop computers today 109 Gigaflops (GFlops)

supercomputers nowadays 1012 Teraflops (TFlops)the aim of today 1015 Petaflops (PFlops)

next step 1018 Exaflops (EFlops)

• How can parallel processing help achieving these goals?

• Why should we talk about it at the Scientific Computin course?

Example 12.1. Why speeds of order 1012 are not enough?Weather forecast in Europe for next 48 hoursfrom sea lever upto 20km– need to solve a PDE in 3D (xyz and t)The volume of the region: 5000∗4000∗20 km3.Stepsize in xy-plane 1km Cartesian mesh z-direction: 0.1km. Timesteps: 1min.

Around 1000 flop per each timestep. As a result, ca


5000∗4000∗20km3×10 rectangles per km3 = 4∗109meshpoints

and

4∗109 meshpoints ∗48∗60timesteps×103 flop≈ 1.15∗1015 flop.

If a computer can do 109 flops, it will take around

1.15∗1015 flop / 109 flop= 1.15∗106 seconds ≈ 13 days !!

But, with 1012 flops,1.15∗103 seconds < 20min.

Not hard to imagine “small” changes in given scheme such that 1012 flops not enough:


• Reduce the mesh stepsize to 0.1km in each directions and timestep to 10seconds and the total time would grow from

20min to 8days.

• We could replace the Europe with the whole World model (the area: 2 ∗107km2 −→ 5∗108km2) and to combine the model with an Ocean model.

Therefore, we must say: The needs of science and technology grow faster thanavailable possibilities, need only to change ε and h to get unsatisfied again!

But again, why do we need a parallel computer to achieve this goal?


Example 12.2Have a look at the following piece of code:� �

do i=1,1000000000000

z(i)=x(i)+y(i) # ie. 3∗1012 memory accesses

end do�Assuming, that data is traveling with the speed of light 3 ∗ 108m/s, for finishing

the operation in 1 second, the average distance of a memory cell must be:

r = 3∗108m/s∗1s3∗1012 = 10−4m.

Typical computer memory – a rectangular mesh. The length of each edge would be

2∗10−4m√3∗1012 ≈ 10−10m,

– the size of a small atom!


Why parallel computing not yet the only way of doing it?Three main reasons: hardware , algorithms and software

Hardware: speed of

• networking

• peripheral devices

• memory access

do not cope with the speed growth in pro-cessor capacity.

Algorithms: An enormous number ofdifferent algorithms exist but

• problems start with their im-plementation on some real lifeapplication

Software: development in progress;

• no good enough automaticparallelisers

• everything done as handwork

• no easily portable libraries for dif-ferent architectures

• does the paralleising effort pay offat all?

266 // Computing 12.2 Classification of Parallel Computers

12.2 Classification of Parallel Computers

• Architecture

– Single processor computer

– Multicore processor

– distributed system

– shared memory system

• Network topology

– ring, array, hypercube, . . .

• Memory access

– shared , distributed , hybrid

• operating system

– UNIX,

– LINUX,

– (OpenMosix)

– WIN*

• scope

– supercomputing

– distributed computing

– real time sytems

– mobile systems

– grid and cloud computing

– etc

267 // Computing 12.3 Parallel Computational Models

But impossible to ignore (implicit or explicit) parallelism in a computer or a setof computers

List of top parallel computer systems (http://www.top500.org).

12.3 Parallel Computational Models

• Data parallelism: e.g. vector processor

• Control Parallelism: e.g. shared memory

– Each process has access to all data. Access to this data is coordinated bysome form of locking procedure

• Message Passing: e.g. MPI

– Independent processes with independent memory. Communication be-tween the processes via sending messages.

http://www.top500.org


• Other models:

– Hybrid versions of the above models,

– remote memory operations (“one-sided” sends and receives)

13 Parallel programming models

Many different models for expressing parallelism in programming languages

• Actor model

– Erlang

– Scala

• Coordination languages

– Linda

• CSP-based (Communicating Se-quential Processes)

– FortranM

– Occam


• Dataflow

– SISAL (Streams and Itera-tion in a Single AssignmentLanguage)

• Distributed

– Sequoia

– Bloom

• Event-driven and hardware de-scription

– Verilog hardware descriptionlanguage (HDL)

• Functional

– Concurrent Haskell

• GPU languages

– CUDA

– OpenCL

– OpenACC

– OpenHMPP (HMPP for Hy-brid Multicore Parallel Pro-gramming)

• Logic programming

– Parlog

• Multi-threaded

– Clojure


• Object-oriented

– Charm++

– Smalltalk

• Message-passing

– MPI

– PVM

• Partitioned global address space(PGAS)

– High Performance Fortran(HPF)

271 // programming models 13.1 HPF

13.1 HPF

Partitioned global address space parallel programming modelFortran90 extension

• SPMD (Single Program Multiple Data) model

• each process operates with its own part of data

• HPF commands specify which processor gets which part of the data

• Concurrency is defined by HPF commands based on Fortran90

• HPF directives as comments:!HPF$ <directive>


HPF example: matrix multiplication� �PROGRAM ABmult

IMPLICIT NONEINTEGER , PARAMETER : : N = 100INTEGER , DIMENSION (N,N) : : A, B , CINTEGER : : i , j

! HPF$ PROCESSORS s q u a r e ( 2 , 2 )! HPF$ DISTRIBUTE (BLOCK,BLOCK) ONTO s q u a r e : : C! HPF$ ALIGN A( i , * ) WITH C( i , * )! r e p l i c a t e c o p i e s o f row A ( i , * ) on to proc . s which compute C( i , j )! HPF$ ALIGN B( * , j ) WITH C( * , j )! r e p l i c a t e c o p i e s o f c o l . B ( * , j ) ) on to proc . s which compute C( i , j )

A = 1B = 2C = 0DO i = 1 , N

DO j = 1 , N! A l l t h e work i s l o c a l due t o ALIGNsC( i , j ) = DOT_PRODUCT(A( i , : ) , B ( : , j ) )


END DOEND DOWRITE( * , * ) CEND�

HPF programming methodology

• Need to find balance between concurrency and communication

• the more processes the more communication

• aiming to

– find balanced load based from the owner calculates rule

– data locality

Easy to write a program in HPF but difficult to gain good efficiencyProgramming in HPF technique is more or less like this:


1. Write a correctly working serial program, test and debug it

2. add distribution directives introducing as less as possible communication

275 // programming models 13.2 OpenMP

13.2 OpenMP

OpenMP tutorial (http://www.llnl.gov/computing/tutorials/openMP/)

• Programming model based on thread parallelism

• Helping tool for a programmer

• Based on compiler directivesC/C++, Fortran*

• Nested Parallelism is supported(though, not all implementations support it)

• Dynamic threads

• OpenMP has become a standard

http://www.llnl.gov/computing/tutorials/openMP/

http://www.llnl.gov/computing/tutorials/openMP/


• Fork-Join model


OpenMP Example: Matrix Multiplication:� �! h t t p : / / www. l l n l . gov / comput ing / t u t o r i a l s / openMP / e x e r c i s e . h tm lC******************************************************************************C OpenMp Example − Ma t r ix M u l t i p l y − F o r t r a n V e r s i o nC FILE : omp_mm . fC DESCRIPTION :C D e m o n s t r a t e s a m a t r i x m u l t i p l y u s i n g OpenMP . Threads s h a r e row i t e r a t i o n sC a c c o r d i n g to a p r e d e f i n e d chunk s i z e .C LAST REVISED : 1 / 5 / 0 4 B l a i s e BarneyC******************************************************************************

PROGRAM MATMULT

INTEGER NRA, NCA, NCB, TID , NTHREADS, I , J , K, CHUNK,+ OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM

C number of rows in m a t r i x APARAMETER (NRA=62)

C number of columns in m a t r i x APARAMETER (NCA=15)

C number of columns in m a t r i x BPARAMETER (NCB=7)

REAL*8 A(NRA,NCA) , B(NCA,NCB) , C(NRA,NCB)


C S e t loop i t e r a t i o n chunk s i z eCHUNK = 10

C Spawn a p a r a l l e l r e g i o n e x p l i c i t l y s c o p i n g a l l v a r i a b l e s!$OMP PARALLEL SHARED(A, B , C ,NTHREADS,CHUNK) PRIVATE ( TID , I , J ,K)

TID = OMP_GET_THREAD_NUM( )IF ( TID . EQ . 0 ) THEN

NTHREADS = OMP_GET_NUM_THREADS( )PRINT * , ’ S t a r t i n g m a t r i x m u l t i p l e example wi th ’ , NTHREADS,

+ ’ t h r e a d s ’PRINT * , ’ I n i t i a l i z i n g m a t r i c e s ’

END IF

C I n i t i a l i z e m a t r i c e s!$OMP DO SCHEDULE( STATIC , CHUNK)

DO 30 I =1 , NRADO 30 J =1 , NCA

A( I , J ) = ( I−1) +( J−1)30 CONTINUE

!$OMP DO SCHEDULE( STATIC , CHUNK)DO 40 I =1 , NCA

DO 40 J =1 , NCBB( I , J ) = ( I−1) * ( J−1)

40 CONTINUE


!$OMP DO SCHEDULE( STATIC , CHUNK)DO 50 I =1 , NRA

DO 50 J =1 , NCBC( I , J ) = 0

50 CONTINUE

C Do m a t r i x m u l t i p l y s h a r i n g i t e r a t i o n s on o u t e r l oopC D i s p l a y who does which i t e r a t i o n s f o r d e m o n s t r a t i o n p u r p o s e s

PRINT * , ’ Thread ’ , TID , ’ s t a r t i n g m a t r i x m u l t i p l y . . . ’!$OMP DO SCHEDULE( STATIC , CHUNK)

DO 60 I =1 , NRAPRINT * , ’ Thread ’ , TID , ’ d i d row ’ , I

DO 60 J =1 , NCBDO 60 K=1 , NCA

C( I , J ) = C( I , J ) + A( I ,K) * B(K, J )60 CONTINUE

C End of p a r a l l e l r e g i o n!$OMP END PARALLEL

C P r i n t r e s u l t sPRINT * , ’ ****************************************************** ’PRINT * , ’ R e s u l t Ma t r i x : ’DO 90 I =1 , NRA


DO 80 J =1 , NCBWRITE( * , 7 0 ) C( I , J )

70 FORMAT(2 x , f8 . 2 , $ )80 CONTINUE

PRINT * , ’ ’90 CONTINUE

PRINT * , ’ ****************************************************** ’PRINT * , ’ Done . ’

END�

Documents

260// Computing12.1 Motivation 12 Parallel Computingeero/SC/lecture12.pdf · 264// Computing12.1 Motivation Example 12.2 Have a look at the following piece of code: do i=1,1000000000000