Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
260 // Computing 12.1 Motivation
12 Parallel Computing
12.1 MotivationWhy do we need computing power in an increasing manner?Much-cited legend: In 1947 computer engineer Howard Aiken said that USA will
need in the future at most 6 computers!
Gordon Moore’s (founder of Intel) law:(1965: the number of switches doubles every second year )1975: - refinement of the above: The number of switches on a CPU doubles
every 18 months
Until 2020 or 2030 we would reach in such a way to the atomic level (or quantumcomputer)
261 // Computing 12.1 Motivation
Flops:
first computers 102 100 Flopsdesktop computers today 109 Gigaflops (GFlops)
supercomputers nowadays 1012 Teraflops (TFlops)the aim of today 1015 Petaflops (PFlops)
next step 1018 Exaflops (EFlops)
• How can parallel processing help achieving these goals?
• Why should we talk about it at the Scientific Computin course?
Example 12.1. Why speeds of order 1012 are not enough?Weather forecast in Europe for next 48 hoursfrom sea lever upto 20km– need to solve a PDE in 3D (xyz and t)The volume of the region: 5000∗4000∗20 km3.Stepsize in xy-plane 1km Cartesian mesh z-direction: 0.1km. Timesteps: 1min.
Around 1000 flop per each timestep. As a result, ca
262 // Computing 12.1 Motivation
5000∗4000∗20km3×10 rectangles per km3 = 4∗109meshpoints
and
4∗109 meshpoints ∗48∗60timesteps×103 flop≈ 1.15∗1015 flop.
If a computer can do 109 flops, it will take around
1.15∗1015 flop / 109 flop= 1.15∗106 seconds ≈ 13 days !!
But, with 1012 flops,1.15∗103 seconds < 20min.
Not hard to imagine “small” changes in given scheme such that 1012 flops not enough:
263 // Computing 12.1 Motivation
• Reduce the mesh stepsize to 0.1km in each directions and timestep to 10seconds and the total time would grow from
20min to 8days.
• We could replace the Europe with the whole World model (the area: 2 ∗107km2 −→ 5∗108km2) and to combine the model with an Ocean model.
Therefore, we must say: The needs of science and technology grow faster thanavailable possibilities, need only to change ε and h to get unsatisfied again!
But again, why do we need a parallel computer to achieve this goal?
264 // Computing 12.1 Motivation
Example 12.2Have a look at the following piece of code:� �
do i=1,1000000000000
z(i)=x(i)+y(i) # ie. 3∗1012 memory accesses
end do�Assuming, that data is traveling with the speed of light 3 ∗ 108m/s, for finishing
the operation in 1 second, the average distance of a memory cell must be:
r = 3∗108m/s∗1s3∗1012 = 10−4m.
Typical computer memory – a rectangular mesh. The length of each edge would be
2∗10−4m√3∗1012 ≈ 10−10m,
– the size of a small atom!
265 // Computing 12.1 Motivation
Why parallel computing not yet the only way of doing it?Three main reasons: hardware , algorithms and software
Hardware: speed of
• networking
• peripheral devices
• memory access
do not cope with the speed growth in pro-cessor capacity.
Algorithms: An enormous number ofdifferent algorithms exist but
• problems start with their im-plementation on some real lifeapplication
Software: development in progress;
• no good enough automaticparallelisers
• everything done as handwork
• no easily portable libraries for dif-ferent architectures
• does the paralleising effort pay offat all?
266 // Computing 12.2 Classification of Parallel Computers
12.2 Classification of Parallel Computers
• Architecture
– Single processor computer
– Multicore processor
– distributed system
– shared memory system
• Network topology
– ring, array, hypercube, . . .
• Memory access
– shared , distributed , hybrid
• operating system
– UNIX,
– LINUX,
– (OpenMosix)
– WIN*
• scope
– supercomputing
– distributed computing
– real time sytems
– mobile systems
– grid and cloud computing
– etc
267 // Computing 12.3 Parallel Computational Models
But impossible to ignore (implicit or explicit) parallelism in a computer or a setof computers
List of top parallel computer systems (http://www.top500.org).
12.3 Parallel Computational Models
• Data parallelism: e.g. vector processor
• Control Parallelism: e.g. shared memory
– Each process has access to all data. Access to this data is coordinated bysome form of locking procedure
• Message Passing: e.g. MPI
– Independent processes with independent memory. Communication be-tween the processes via sending messages.
268 // Computing 12.3 Parallel Computational Models
• Other models:
– Hybrid versions of the above models,
– remote memory operations (“one-sided” sends and receives)
13 Parallel programming models
Many different models for expressing parallelism in programming languages
• Actor model
– Erlang
– Scala
• Coordination languages
– Linda
• CSP-based (Communicating Se-quential Processes)
– FortranM
– Occam
269 // Computing 12.3 Parallel Computational Models
• Dataflow
– SISAL (Streams and Itera-tion in a Single AssignmentLanguage)
• Distributed
– Sequoia
– Bloom
• Event-driven and hardware de-scription
– Verilog hardware descriptionlanguage (HDL)
• Functional
– Concurrent Haskell
• GPU languages
– CUDA
– OpenCL
– OpenACC
– OpenHMPP (HMPP for Hy-brid Multicore Parallel Pro-gramming)
• Logic programming
– Parlog
• Multi-threaded
– Clojure
270 // Computing 12.3 Parallel Computational Models
• Object-oriented
– Charm++
– Smalltalk
• Message-passing
– MPI
– PVM
• Partitioned global address space(PGAS)
– High Performance Fortran(HPF)
271 // programming models 13.1 HPF
13.1 HPF
Partitioned global address space parallel programming modelFortran90 extension
• SPMD (Single Program Multiple Data) model
• each process operates with its own part of data
• HPF commands specify which processor gets which part of the data
• Concurrency is defined by HPF commands based on Fortran90
• HPF directives as comments:!HPF$ <directive>
272 // programming models 13.1 HPF
HPF example: matrix multiplication� �PROGRAM ABmult
IMPLICIT NONEINTEGER , PARAMETER : : N = 100INTEGER , DIMENSION (N,N) : : A, B , CINTEGER : : i , j
! HPF$ PROCESSORS s q u a r e ( 2 , 2 )! HPF$ DISTRIBUTE (BLOCK,BLOCK) ONTO s q u a r e : : C! HPF$ ALIGN A( i , * ) WITH C( i , * )! r e p l i c a t e c o p i e s o f row A ( i , * ) on to proc . s which compute C( i , j )! HPF$ ALIGN B( * , j ) WITH C( * , j )! r e p l i c a t e c o p i e s o f c o l . B ( * , j ) ) on to proc . s which compute C( i , j )
A = 1B = 2C = 0DO i = 1 , N
DO j = 1 , N! A l l t h e work i s l o c a l due t o ALIGNsC( i , j ) = DOT_PRODUCT(A( i , : ) , B ( : , j ) )
273 // programming models 13.1 HPF
END DOEND DOWRITE( * , * ) CEND�
HPF programming methodology
• Need to find balance between concurrency and communication
• the more processes the more communication
• aiming to
– find balanced load based from the owner calculates rule
– data locality
Easy to write a program in HPF but difficult to gain good efficiencyProgramming in HPF technique is more or less like this:
274 // programming models 13.1 HPF
1. Write a correctly working serial program, test and debug it
2. add distribution directives introducing as less as possible communication
275 // programming models 13.2 OpenMP
13.2 OpenMP
OpenMP tutorial (http://www.llnl.gov/computing/tutorials/openMP/)
• Programming model based on thread parallelism
• Helping tool for a programmer
• Based on compiler directivesC/C++, Fortran*
• Nested Parallelism is supported(though, not all implementations support it)
• Dynamic threads
• OpenMP has become a standard
276 // programming models 13.2 OpenMP
• Fork-Join model
277 // programming models 13.2 OpenMP
OpenMP Example: Matrix Multiplication:� �! h t t p : / / www. l l n l . gov / comput ing / t u t o r i a l s / openMP / e x e r c i s e . h tm lC******************************************************************************C OpenMp Example − Ma t r ix M u l t i p l y − F o r t r a n V e r s i o nC FILE : omp_mm . fC DESCRIPTION :C D e m o n s t r a t e s a m a t r i x m u l t i p l y u s i n g OpenMP . Threads s h a r e row i t e r a t i o n sC a c c o r d i n g to a p r e d e f i n e d chunk s i z e .C LAST REVISED : 1 / 5 / 0 4 B l a i s e BarneyC******************************************************************************
PROGRAM MATMULT
INTEGER NRA, NCA, NCB, TID , NTHREADS, I , J , K, CHUNK,+ OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM
C number of rows in m a t r i x APARAMETER (NRA=62)
C number of columns in m a t r i x APARAMETER (NCA=15)
C number of columns in m a t r i x BPARAMETER (NCB=7)
REAL*8 A(NRA,NCA) , B(NCA,NCB) , C(NRA,NCB)
278 // programming models 13.2 OpenMP
C S e t loop i t e r a t i o n chunk s i z eCHUNK = 10
C Spawn a p a r a l l e l r e g i o n e x p l i c i t l y s c o p i n g a l l v a r i a b l e s!$OMP PARALLEL SHARED(A, B , C ,NTHREADS,CHUNK) PRIVATE ( TID , I , J ,K)
TID = OMP_GET_THREAD_NUM( )IF ( TID . EQ . 0 ) THEN
NTHREADS = OMP_GET_NUM_THREADS( )PRINT * , ’ S t a r t i n g m a t r i x m u l t i p l e example wi th ’ , NTHREADS,
+ ’ t h r e a d s ’PRINT * , ’ I n i t i a l i z i n g m a t r i c e s ’
END IF
C I n i t i a l i z e m a t r i c e s!$OMP DO SCHEDULE( STATIC , CHUNK)
DO 30 I =1 , NRADO 30 J =1 , NCA
A( I , J ) = ( I−1) +( J−1)30 CONTINUE
!$OMP DO SCHEDULE( STATIC , CHUNK)DO 40 I =1 , NCA
DO 40 J =1 , NCBB( I , J ) = ( I−1) * ( J−1)
40 CONTINUE
279 // programming models 13.2 OpenMP
!$OMP DO SCHEDULE( STATIC , CHUNK)DO 50 I =1 , NRA
DO 50 J =1 , NCBC( I , J ) = 0
50 CONTINUE
C Do m a t r i x m u l t i p l y s h a r i n g i t e r a t i o n s on o u t e r l oopC D i s p l a y who does which i t e r a t i o n s f o r d e m o n s t r a t i o n p u r p o s e s
PRINT * , ’ Thread ’ , TID , ’ s t a r t i n g m a t r i x m u l t i p l y . . . ’!$OMP DO SCHEDULE( STATIC , CHUNK)
DO 60 I =1 , NRAPRINT * , ’ Thread ’ , TID , ’ d i d row ’ , I
DO 60 J =1 , NCBDO 60 K=1 , NCA
C( I , J ) = C( I , J ) + A( I ,K) * B(K, J )60 CONTINUE
C End of p a r a l l e l r e g i o n!$OMP END PARALLEL
C P r i n t r e s u l t sPRINT * , ’ ****************************************************** ’PRINT * , ’ R e s u l t Ma t r i x : ’DO 90 I =1 , NRA
280 // programming models 13.2 OpenMP
DO 80 J =1 , NCBWRITE( * , 7 0 ) C( I , J )
70 FORMAT(2 x , f8 . 2 , $ )80 CONTINUE
PRINT * , ’ ’90 CONTINUE
PRINT * , ’ ****************************************************** ’PRINT * , ’ Done . ’
END�