Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
OpenMP Programming 2 Advanced OpenMP Programming
Berk ONAT İTÜ Bilişim Enstitüsü
21 Haziran 2012
Outline
OpenMP Prog. & App. 21.06.2012 2/36
• OpenMP Synchroniza6on Constructs – Single, Cri6cal, Atomic, Barrier
• OpenMP Data Scope Clauses – Firstprivate, Lastprivate
• Advanced OpenMP Direc6ves – Flush, Threadprivate, Copyin
• Run6me library rou6nes • OpenMP “Danger Zones”
– Race Condi6ons – Deadlock – Livelock
Synchronization
3/36
• Synchroniza6on direc6ves help to organize accesses to shared data by mul6ple threads
• Types – Barrier – Single – Master – Ordered – Cri6cal – Atomic – Locks
21.06.2012 OpenMP Prog. & App.
Data Scope Clauses
4/36
• The OpenMP Data Scope AOribute Clauses are used to explicitly define how variables should be scoped.
• These constructs provide the ability to control the data environment during execu6on of parallel constructs. o They define how and which data variables in the serial sec6on of
the program are transferred to the parallel sec6ons of the program (and back)
o They define which variables will be visible to all threads in the parallel sec6ons and which variables will be privately allocated to all threads
• Types – Firstprivate – Lastprivate – Reduc6on – Copyin
21.06.2012 OpenMP Prog. & App.
Data Scope Clauses (firstprivate)
5/36
• FIRSTPRIVATE Clause: – It combines the behavior of the PRIVATE clause with automa6c
ini6aliza6on of the variables – Variables that are declared to be “firstprivate” are private variables – Listed variables are ini6alized according to the value of their original
objects prior to entry into the parallel or work-‐sharing construct.
C/C++ Fortran
firstprivate (list) FIRSTPRIVATE (list)
Example: Check firstprivate.c example code.
21.06.2012 OpenMP Prog. & App.
Data Scope Clauses (lastprivate)
6/36
C/C++ Fortran
• LASTPRIVATE Clause: – It combines the behavior of the PRIVATE clause with a copy from
the last loop itera6on or sec6on to the original variable object. – A performance penalty is likely to be associated with the use of
lastprivate, because the OpenMP library needs to keep track of which thread executes the last itera6on. For a sta6c workload distribu6on scheme this is rela6vely easy to do, but for a dynamic scheme this is more costly.
lastprivate (list) LASTPRIVATE (list)
Example: Check lastprivate.c example code.
21.06.2012 OpenMP Prog. & App.
Data Scope Clauses (lastprivate)
7/36
• LASTPRIVATE Clause: …
output
If the lastprivate clause is used on a sec6ons construct, the object gets assigned the value that it has at the end of the lexically last sec6ons construct.
21.06.2012 OpenMP Prog. & App.
Data Scope Clauses (reduction)
8/36
• REDUCTION Clause: – It provides some forms of recurrence
calcula6ons (involving mathema6cally associa6ve and commuta6ve operators) so that they can be performed in parallel without code modifica6on.
– The results will be shared and it is not necessary to define as shared variables.
C/C++ Fortran
reducAon (operator : list) REDUCTION (operator : list)
Example: Check reduction.c or reduction.F example code.
21.06.2012 OpenMP Prog. & App.
Data Scope Clauses (reduction)
9/36
• REDUCTION operator values:
Example: Check reduction.c example code.
C/C++ Fortran
21.06.2012 OpenMP Prog. & App.
Data Scope Clauses (reduction)
10/36 21.06.2012
SUBROUTINE REDUCTION(A, B, C, D, X, Y, N) REAL :: X(*), A, D INTEGER :: Y(*), N, B, C INTEGER :: I A=0 B=0 C = Y(1) D = X(1)
!$OMP PARALLEL DO PRIVATE(I) SHARED(X, Y, N) REDUCTION(+:A) & !$OMP& REDUCTION(IEOR:B) REDUCTION(MIN:C) REDUCTION(MAX:D)
DO I=1,N A = A + X(I) B = IEOR(B, Y(I)) C = MIN(C, Y(I)) IF (D < X(I)) D = X(I) END DO
END SUBROUTINE REDUCTION
Fortran Intrinsic since OpenMP 2.5 C/C++ since OpenMP 3.1 (July 2011)
OpenMP Prog. & App.
Data Scope Clauses (copyin)
11/36
• COPYIN Clause: – It provides a means for assigning the same value to thread private
variables for all threads in the team. – List contains the names of variables to copy. In Fortran, the list can
contain both the names of common blocks and named variables. – The master thread variable is used as the copy source. The team
threads are ini6alized with its value upon entry into the parallel construct.
C/C++ Fortran
copyin (list) COPYIN (list)
21.06.2012 OpenMP Prog. & App.
Advanced OpenMP Constructs
12/36
• THREADPRIVATE Direc6ve: – It controls the defini6on of global data (Sta6c in C/C++ and Common
Block in Fortran) as a private or shared – By default global data is shared but some6mes you need to define as
private – Supports pointer in C/C++ & Fortran , Allocatable in Fortran – COPYIN clause can be used to ini6liza6on of the global data variable
C/C++ Fortran
#pragma omp threadprivate (list) !$OMP THREADPRIVATE (/cmn/, list...)
Example: Check threadprivate.c or threadprivate.F example
21.06.2012 OpenMP Prog. & App.
Advanced OpenMP Constructs
13/36
• THREADPRIVATE Direc6ve: – Each thread then gets its own copy of the variable/common block,
so data wriOen by one thread is not visible to other threads. – Restric6ons:
• To use THREADPRIVATE variables, the parallel regions must be executed by the same number of threads. Each of the threads will con6nue to work on one of the sets of data previously produced.
• The value of the dyn-‐var internal control variable is false at entry to the first parallel region and remains false un6l entry to the second parallel region.
21.06.2012 OpenMP Prog. & App.
Advanced OpenMP Constructs
14/36
• PRIVATE vs. THREADPRIVATE
Data Item C/C++: variable Fortran: variable common block
C/C++: variable Fortran: common block
Where Declared
Start of region or work-sharing group
In declarations of each routine using block or global file scope
Persistent NO YES
Initiliaze FIRSTPRIVATE COPYIN
PRIVATE THREADPRIVATE
21.06.2012 OpenMP Prog. & App.
Clause/Directives Summary
15/36 21.06.2012 OpenMP Prog. & App.
Runtime Library Routines
16/36
• OMP_SET_NUM_THREADS – Its the number of threads that will be used in the next parallel
region. Must be a posi6ve integer. – Notes
• This rou6ne can only be called from the serial por6ons of the code • This call has precedence over the OMP_NUM_THREADS environment variable
C/C++ Fortran
#include <omp.h> void omp_set_num_threads(int n)
USE omp.h SUBROUTINE OMP_SET_NUM_THREADS(N)
21.06.2012 OpenMP Prog. & App.
Runtime Library Routines
17/36
C/C++ Fortran
• OMP_GET_NUM_THREADS – Returns the number of threads that are currently in the team execu6ng
the parallel region from which it is called. – Notes
• If this call is made from a serial por6on of the program, or a nested parallel region that is serialized, it will return 1.
#include <omp.h> int omp_get_num_threads(void)
USE omp.h INTEGER OMP_GET_NUM_THREADS()
21.06.2012 OpenMP Prog. & App.
Runtime Library Routines
18/36
C/C++ Fortran
• OMP_GET_MAX_THREADS – Returns the maximum value that can be returned by a call to the
OMP_GET_NUM_THREADS func6on. – Notes
• Generally reflects the number of threads as set by the OMP_NUM_THREADS environment variable or the OMP_SET_NUM_THREADS() library.
#include <omp.h> int omp_get_max_threads(void)
USE omp.h INTEGER OMP_GET_MAX_THREADS()
Example: Check omp_getEnvInfo.c example code.
21.06.2012 OpenMP Prog. & App.
Runtime Library Routines
19/36
C/C++ Fortran
• OMP_GET_THREAD_NUM – Returns the thread number of the thread, within the team, making this
call. This number will be between 0 and OMP_GET_NUM_THREADS-‐1. The master thread of the team is thread 0
– Notes • If called from a nested parallel region, or a serial region, this func6on will return 0.
#include <omp.h> int omp_get_thread_num(void)
USE omp.h INTEGER OMP_GET_THREAD_NUM()
21.06.2012 OpenMP Prog. & App.
Runtime Library Routines
20/36
C/C++ Fortran
• OMP_GET_WTIME – Provides a portable wall clock 6ming rou6ne. Returns seconds in double
precision. – Usually used in "pairs" with the value of the first call subtracted from the
value of the second call to obtain the elapsed 6me for a block of code.
#include <omp.h> double omp_get_w8me(void)
USE omp.h DOUBLE PRECISION OMP_GET_WTIME()
Example: Check omp_wtime.F example code.
GNU fortran compiler implemented correctly
USE: gfortran -‐fopenmp omp_w6me.F -‐o omp_w6me.x
21.06.2012 OpenMP Prog. & App.
Advanced OpenMP Constructs
21/36
• FLUSH Direc6ve: – If a thread updates shared data,
• the new values will first be saved in a register • then stored back to the local cache
– The updates are thus not necessarily immediately visible to other threads
– On a cache-‐coherent machine, the modifica6on to cache is broadcast to other processors to make them aware of changes
– It depends on the plaporm !!!
21.06.2012 OpenMP Prog. & App.
Advanced OpenMP Constructs
22/36
• FLUSH Direc6ve: – The OpenMP standard specifies that all modifi-‐
ca6ons are wriOen back to main memory and are thus available to all threads, at synchroniza6on points in the program.
– Some6mes updated values of shared values must become visible to other threads in-‐between synchroniza6on points.
– The FLUSH direc6ve is used for this purpose. – The purpose of the flush direc6ve is to to make a
thread’s temporary view of shared data consistent with the values in memory
21.06.2012 OpenMP Prog. & App.
Advanced OpenMP Constructs
23/36
• FLUSH Direc6ve: – Implicit FLUSH opera6ons
• All explicit and implicit barriers (e.g., at the end of a parallel region or work sharing construct)
• Entry to and exit from cri6cal regions • Entry to and exit from lock rou6nes
C/C++ Fortran
#pragma omp flush (list) !$OMP FLUSH (list)
Example: Check omp_flush_prod_cons.c example code.
21.06.2012 OpenMP Prog. & App.
Conditional Parallel Regions
24/36
• IF Clause: – it is used to specify condi6onal execu6on – supported on the parallel construct only
21.06.2012 OpenMP Prog. & App.
OpenMP Danger Zones
25/36
• There are three major SMP programming error – Race Condi6ons
• A race condi6on exists when two unsynchronized threads access the same shared variable with at least one thread modifying the variable.
– Deadlock • deadlock describes a condi6on where two or more threads are blocked (hang) forever, wai6ng for each other
– Livelock • mul6ple threads working individual tasks which the ensemble can not finish.
21.06.2012 OpenMP Prog. & App.
OpenMP Danger Zones
26/36
• Race Condi6ons – Another common mistake is the use of un-‐ini6alized variables. Remember that private variables do not have ini6al values upon entering a parallel construct. Use the firstprivate and lastprivate clauses to ini6alize them only when necessary, because doing so adds extra overhead.
– Debug: Intel C++ Compiler specific environment variable KMP_LIBRARY=serial or just compile without -‐openmp flag.
21.06.2012 OpenMP Prog. & App.
OpenMP Danger Zones
27/36
• Global Data:
… include “global.h” … !$omp parallel private(j) do j=1, n call suba(j) end do !$omp end do !$omp end paralel
subrou6ne suba(j) … include “global.h” … do i=1, m b(i) = j end do return end
common /work/ a(m,n), b(m)
race condi6on
21.06.2012 OpenMP Prog. & App.
OpenMP Danger Zones
28/36
• Global Data: RACE CONDITION !!!
subrou6ne suba(j=1) … include “global.h” … do i=1, m b(i) = 1 end do return end
subrou6ne suba(j=2) … include “global.h” … do i=1, m b(i) = 2 end do return end
Thread 1 Thread 2
both thread changes same variable
21.06.2012 OpenMP Prog. & App.
OpenMP Danger Zones
29/36
• Global Data: SOLUTION 1
… include “global.h” … !$omp parallel private(j) do j=1, n call suba(j) end do !$omp end do !$omp end paralel
subrou6ne suba(j) … include “global.h” TID = omp_get_thread_num()+1 do i=1, m b(i, TID) = j end do return end
common /work/ a(m,n) common /tprivate/b(m,nthreads)
extend b to reach each thread to it’s unique storage area
21.06.2012 OpenMP Prog. & App.
OpenMP Danger Zones
30/36
• Global Data: SOLUTION 2
… include “global.h” … !$omp parallel private(j) do j=1, n call suba(j) end do !$omp end do !$omp end paralel
subrou6ne suba(j) … include “global.h” … do i=1, m b(i) = j end do return end
common /work/ a(m,n) /tprivate/ b(m) !$omp threadprivate (tprivate)
Compiler create private cop of b for each thread
21.06.2012 OpenMP Prog. & App.
OpenMP Danger Zones
31/36
• Race Condi6ons – Lab: see racecond.F
• The result varies unpredictably based on specific order of execu6on for each sec6on
• Wrong answers produced without warning !!!
– Fixed version: racecond-‐fixed.F • Choose “IC” counter and check it every calcula6on. IC forces the order of calcula6ons. FLUSH forces the update of the shared variable.
21.06.2012 OpenMP Prog. & App.
OpenMP Danger Zones
32/36
• Deadlock – Two or more threads in a process concurrently
access the same memory loca6on, at least one of the threads is accessing the memory loca6on for wri6ng
– Types … • ‘poten6al deadlock’: It is a deadlock that did not occur in a given run, but can occur in different runs of the program depending on the 6mings of the requests for locks by the threads.
• ‘actual deadlock’: It is one that actually occurred in a given run of the program. An actual deadlock causes the threads involved to hang, but may or may not cause the whole process to hang.
21.06.2012 OpenMP Prog. & App.
OpenMP Danger Zones
33/36
• Deadlock #pragma omp parallel private(me) { int me; me = omp_get_thread_num(); if (me == 0) goto MASTER;
#pragma omp barrier
MASTER:
#pragma omp single printf(“done”); }
In this example deadlock occurs because thread arrives different barriers. If one thread skip a barrier, it generally causes deadlock. Nested CRITICAL sec6ons or LOCK can cause deadlock
21.06.2012 OpenMP Prog. & App.
OpenMP Danger Zones
34/36
• Livelock – A livelock is similar to
a deadlock, except that the states of the processes involved in the livelock constantly change with regard to one another, none progressing.
!$OMP PARALLEL PRIVATE(ID) ID = OMP_GET_THREAD_NUM() N = OMP_GET_NUM_THREADS() 1000 CONTINUE PHASES(ID)=UPDATE(U,ID)
!$OMP SINGLE RES = MATCH(PHASES,N) !$OMP END SINGLE IF (RES**2.LT.TOL) GOTO 2000 GOTO 1000 2000 CONTINUE !$OMP END PARALLEL
If the square of RES is never smaller than TOL, the program spins endlessly in livelock
21.06.2012 OpenMP Prog. & App.
OpenMP Death Traps
35/36
• Are you using thread safe libraries? • I/O inside a parallel region can interleave
unpredictably. • Make sure you understand what your constructors
are doing with private objects. • Private variable can mask global ones. • Understand when shared memory is coherent.
When in doubts use FLUSH • NOWAIT removes implied barriers
21.06.2012 OpenMP Prog. & App.
OpenMP 3.0: task clause!
36/36 21.06.2012
• TASK Direc6ve: – Run defined subrou6ne, func6on or code block with omp task in a
seperate thread – task direc6ve is Vendor specific! (Intel taskq) – task direc6ve is implemented with OpenMP 3.0 standard (intel
10.1 and gcc 4.4 compilers implemented task in late 2008 )
C/C++ Fortran
#pragma omp task !$OMP TASK
Example: Check omp_linked_list.c and omp_task_omp3.c example code.
OpenMP Prog. & App.
OpenMP 3.0: Intel taskq
37/36 21.06.2012
C/C++ Fortran
#pragma intel omp task !$INTEL OMP TASKQ
Example: Check omp_task_intel.c example code.
bash $ icc omp_task_intel.c -‐o omp_task_intel.x -‐openmp -‐openmp-‐task intel
OpenMP Prog. & App.
TASK BARRIERS: taskwait
38/36 21.06.2012 OpenMP Prog. & App.
*Ref. 4
TASK BARRIERS: taskgroup
39/36 21.06.2012 OpenMP Prog. & App.
*Ref. 4
TASK SYNCRONIZATION: taskyield
40/36 21.06.2012 OpenMP Prog. & App.
*Ref. 4
Task, Workshare or Nested ?
41/36 21.06.2012 OpenMP Prog. & App.
*Ref. 4
Alignment Evalua6on Program
Task, Workshare or Nested ?
42/36 21.06.2012 OpenMP Prog. & App.
*Ref. 4
SparseLU Program
Lab: Exercise 1
43/36
Write correct OpenMP pragmas to parallel matrix mul6plica6on serial code.
$ ../openmp-‐applica6on/ matrixmultp.c
21.06.2012 OpenMP Prog. & App.
Lab: Exercise 2
44/36
Calculate
1. Write a simple serial code first
2. Implement OpenMP pragmas to your code
3. Do your threads synchronized? €
41+ x 2
dx = h 41+ xi
2 =i=1
N
∑ h 41+ (h(i − 1
2))2
i=1
N
∑0
1
∫
21.06.2012 OpenMP Prog. & App.
Lab: Exercise 3
45/36
Write A=LU decomposi6on code. Back subs6tu6on is not necessary for now. Just do the paraleliza6on of
the given loops below. Ini6alize matrix with any number (Ex: A[i][j]=1.0+(i*size)+j) Print L and A a|er LU decomposi6on. (A will be new U) Can you implement par8al pivo8ng and scaling ?
LU-‐factoriza8on algorithm : for k=1 to n-1 ! for i=k+1 to n ! L(i,k)=A(i,k)/A(k,k) ! for j=k+1 to n ! A(i,j)=A(i,j)-L(i,k)*A(k,j) ! end for ! end for ! end for !
21.06.2012 OpenMP Prog. & App.
References: 1. Blaise Barney, OpenMP Tutorial, hOps://compu6ng.llnl.gov/tutorials/openMP/ 2. Rohit Chandra, Parallel Programming in OpenMP, 2000, Morgan Kaufmann 3. Micheal J. Quinn, Parallel Programming in C with MPI and OpenMP, 2003, McGraw
Hill 4. Eduard Ayguadé, Alejandro Duran, Jay Hoeflinger, Federico Massaioli and Xavier
Teruel, An Experimental Evalua6on of the New OpenMP Tasking Model , LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, Lecture Notes in Computer Science, 2008, Volume 5234/2008, 63-‐77